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PROTEIN DESIGN AUTOMATION FOR PROTEIN LIBRARIES 

This application is a continuing application of Serial No. 09/927.790» filed on August 10. 2001 and 
claims the benefit of the filing dates of Serial Nos. 60/31 1 ,545, filed on August 1 0. 2001 . 60/324,899, 
filed on Septennber 25, 2001. 60/351,937, filed on January 25, 2002, and 60/352,103, filed on January 
25, 2002. 

FIELD OF THE INVENTION 

The invention relates to the use of a variety of computation methods, including protein design 
automation (PDA^") technology to generate computationally prescreened secondary libraries of 
proteins, and to nnethods of making and methods and compositions utilizing the libraries. 

BACKGROUND OF THE INVENTION 

Directed molecular evolution may be used to create proteins and enzymes with novel functions and 
properties. Starting with a known natural protein, several rounds of mutagenesis, functional 
screening, and/or selection and propagation of successful sequences are performed. The advantage 
of this process is that it may be used to rapidly evolve any protein without knowledge of its structure. 
Several different mutagenesis strategies exist, including point mutagenesis by error-prone PGR, 
cassette mutagenesis, and DN A shuffling. These techniques have had many successes; however, 
they are all handicapped by their inability to produce more than a tiny fraction of the potential changes 
and their ability to effectively explore all possible sequences. For example, there are 20^^^ possible 
amino acid changes for an average protein approximately 500 amino acids long. Clearly, the 
mutagenesis and functional screening of so many mutants is impossible; directed evolution provides a 
very sparse sampling of the possible sequences and hence examines only a small portion of possible 
improved proteins, typically point mutants or recombinations of existing sequences. By sampling 
randomly from the vast number of possible sequences, directed evolution is unbiased and broadly 
applicable, but inherently inefficient because it ignores all structural and biophysical knowledge of 
proteins. 
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In contrast, computatfonal methods may be used to screen enormous sequence libraries (up to or 
more than lO"" in a single calculation) overcoming the key limitation of experimental library screening 
methods such as directed molecular evolution. There are a wide variety of methods known for 
generating and evaluating sequences. These Include, but are not limited to. sequence profiling 
(Bowie and Bsenberg. Sdence 253(5016): 164-70. (1991)). rotamer library selections (Dahiyat and 
Mayo. Protein Sci 5(5): 895-903 (1996); Dahiyat and Mayo, Science 278(5335): 82-7 (1997); 
Desjarlais and Handel. Protein Science 4: 2006-2018 (1995); Harbury et al. PNAS USA 92(18): 8408- 
8412 (1995); Kono et al., Proteins: Structure. Function and Genetics 19: 244-255 (1994); Hellinga and 
Richards. PNAS USA 91: 5803-5807 (1994)); and residue pair potentials (Jones. Protein Science 3: 
567-574. (1994)). (see Altschul and Koonin. Trends Biochem Sd 23(1 1): 444-447. (1998); (see 
Altschul et al.. J. Mol. Biol. 215(3): 403 (1990) and LocWess and Ranganathan. Science 286:295-299 
(1999). Pattern discovery in Biomolecular Data: Tools. Technkiues. and Applications; edited by 
Jason T.L. Wang, Bruce A. Shapiro. Dennis Shasha. New York: Oxford University. 1999.) 

Directed evolution is a random technique. Currentiy. tiiere is no comprehensive rational design 
approach ttiat allows effident exploration of all possible sequence space. 



SUMWIARY OF THE INVENTION 

The present invention provides metfiods for generating a secondary library of scaffold protein variants 
comprising provkling a primary library comprising a rank-ordered list or filtered set of scaffold protein 
primary variant sequences. A list of primary variant positions in the primary library is tiien generated, 
and a plurality of the primary variant positions is ttien combined to generate a secondary library of 
secondary sequences. 



It is an object of ttie present invention to provide computational methods for prescreening sequence 
libraries to generate and select secondary libraries, which may ttien be made and evaluated 
experimentally. 



In an additional object, tiie invention provides mettiods for generating a secondary library of scaffokJ 
protein variants comprising providing a primary library comprising a rank-onlered list or filtered set of 
scaffold protein primary variant sequences, and generating a probability distribution of amino add 
residues in a plurality of variant positions. The plurality of tiie amino add residues is combined to 
generate a secondary library of secondary sequences. These sequences may Oien be optionally 
synthesized and tested, in a variety of ways, induding multiplexing PGR witii pooled oligonucleotides, 
error prone PCR. gene shuffling, etc. 
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In a further object, the invention provides compositions comprising a plurality of secondary variant 
proteins or nucleic acids encoding the proteins, wherein the plurality comprises all or a subset of the 
secondary library. The invention further provides cells comprising the library, particuiariy mammalian 
cells. 



In an additional object, the invention provides methods for generating a secondary library of scaffold 
protein variants comprising providing a first library rank-ordered list or filtered set of scaffold protei n 
primary variants, generating a probability distribution of amino acid residues in a plurality of variant 
positions; and synttiesizing a plurality of scaffold protein secondary variants comprising a plurality of 
the amino acid residues to form a secondary library. At least one of the secondary variants is 
different from the primary variants. 

it is a further object of the invention to provide a metiiod for receiving a scaffold protein structure with 
residue positions; selecting a collection of variable residue positions from said residue positions; 
establishing a group of potential rotamers for each of said variable residue positions, and wherein a 
first group for a first variable residue position has a first set of rotamers from at least two 
different amino acid side chains, and wherein a second group for a second variable residue position 
has a second set of rotamers from at least two different amino acid side chains; and, analyzing the 
interaction of each of said rotamers in each group witii all or part of the remainder of said protein to 
generate a set of optimized protein sequences. 

It Is a further object of the invention to provide a metiiod for receiving a scaffold protein with residue 
positions; selecting a collection of variable residue positions fro m said residue positions; establishing 
a group of potential amino acids for each of said variable residue positions, wherein a first group for a 
first variable residue position has a first set of at least two amino acid side chains, and wherein a 
second group for a second variable residue position has a second set of at least two different amino 
acid side chains; and, analyzing the interaction of each of said amino acids with all or part of the 
remainder of said protein to generate a set of optimized protein sequences. 

It is a further object of the invention to provide a method for receiving a scaffold protein with residue 
positions; selecting a set of variable residue positions from said residue positions; establishing a 
group of potential rotamers for each of said variable residue positions; analyzing the interaction of 
each of said rotamers with all or part of the remainder of said protein to generate a set of optimized 
protein sequences, wherein said analyzing step includes the use of at least one scoring function; and, 
generating a library of said optimized protein sequences. 

It is a furtiier object of the invention to provide a metiiod for receiving a scaffold protein with residue 
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positions; selecting a set of variable residue positbns from said residue positions: classifying each 
variable residue positfon as either a core, surface or boundary position; establishing a group of 
potential amino adds for each of said variable residue positions, wherein the group for at least one 
variable residue position has at least two different amino acid side chains; and. analyzing the 
interaction of each of said amino acids with all or part of the remainder of said protein to generate a 
set of optimized protein sequences, wherein said analyzing step includes the use of at least one 
scoring function. 

It is a further object of the invention to provide a method for receiving a scaffWd protein with residue 
positions; selecting a set of variable residue positions from said residue positions; establishing a 
group of potential rotamers for each of said variable residue positions, wherein the group for at least 
one variable residue position has rotamers of at least two different amino acid side chains, and 
wherein at least one of said amino acid side chains is from a hydrophNic amino add and. analyzing 
the interaction of each of said rotamers with all or part of the remainder of said protein to generate a 
set of optimized protein sequences, wherein said analyzing step includes the use of at least one 
scoring function. 

It is a further object of the invention to provide a computational method for receiving a scaffold protein 
with residue positions; selecting a collecBon of variable residue positions from said residue positions; 
providing a sequence alignment of a plurality o f related proteins; generating a fi-equency of 
occurrence for individual amino acids in at least a plurality of positions with said alignments; creating 
a pseudo«nergy scoring function using said frequencies; using said pseudo-energy scoring function 
and at least one additional scoring function to generate a set of optimized protein sequences. 

It is a further object of the invention to provide a computational method comprising receiving a 
scaffold protein with residue positions; selecting a collection of variable residue positions from said 
residue positions; providing a sequence alignment of a plurality of related proteins; generating a 
frequency of occurrence for individual amino adds in at least a plurality of positions with said proteins; 
selecfing a:group of potential amino acids for each of said variable residue positions, wherein a first 
group for a first variable residue position has a first set of at least two amino acid side diains, and 
wherein a second group for a second variable residue position has a second set of at least tvvo 
different amino acid side chains according to their frequency of occurrence; and, analyzing the 
interaction of each of said amino adds at each variable residue position with all or part of the 
remainder of said protein using at least one scoring function to generate a set of optimized protein 
sequences. 

It is a further object of the invention to provide a method computational method for receiving a scaffold 
protein with residue positions; selecting a collection of variable residue positions from said residue 
positions; providing an amino add substitution matrix; creating a pseudo-energy scoring function 
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using said matrix; using said pseudo-energy scoring function and at least one additional scoring 
function to generate a set of optimized protein sequences. 

It is a further object of the invention to provide a method for receiving a scaflbid protein with residue 
positions; selecting a collection of at least one variable residue positton from said residue positions; 
importing a set of coordinates for a scaffold protein, said scaffold protein comprising amino acid 
positions; analyzing the interaction of each of said amino acids with all or part of the remainder of said 
protein; utilizing a plurality of scoring functions, at least a first a scoring function having a first weight 
and a second scoring function having a second weight, to generate at least one variable decoy 
sequence; and, comparing the scores from said scoring functions of said variable decoy sequence to 
the scores of a reference state to generate modified weights, wherein each weight is increased if the 
corresponding score of the decoy is higher than the corresponding score of the reference state and 
each weight is decreased if the corresponding score of the decoy is lower than the corresponding 
score of the reference state and, wherein the extent of increase or decrease Is based on the relative 
individual and total scores of the decoy and reference states. 

It is a further object of the invention to provide a method for receiving a scaffold protein with residue 
positions; selecting a collection of variable residue positions from said residue positions; importing a 
set of coordinates for a scaffold protein, said scaffold protein comprising amino acid positions; 
generating a variable protein sequence comprising a defined energy state for each amino add 
position; applying an energy increase to at least one of said defined energy states for a least one of 
said amino acid positions; and, generating at least one alternate variable protein sequence. 

It is a further object of the invention to provide a method for receiving a scaflbid protein with residue 
positions; selecting a collection of variable residue positions from said residue positions; importing a 
set of coordinates for a scaffold protein, said scaffold protein comprising amino acid positbns; 
generating a variable protein sequence comprising a defined energy state for each amino add 
position; applying a probability parameter to at least one of said amino acid positions; and generating 
at least one alternate variable protein sequence. 

it is a further object of the invention to provide a method for receiving a scaffold protein with residue 
positions; selecting a collection of variable residue positions from said residue positions; importing a 
set of coordinates for a scaffold protein, said scaffold protein comprising amino acid positions; 
generating a set of optimized variant protein sequences comprising one or more variant amino acids; 
and, applying a clustering algorithm to cluster said set into a plurality of subsets. 

It is a further object of the invention to provide a method for receiving at least one scaffold protein 
structure with variable residue positions of a target protein; computationally generating a set of 
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primary variant amino acid sequences that adopt a conformation similar to the conformation of said 
target protein; and, identifying at least one protein sequence that is similar to at least one member of 
said set of primary variants, but is dissimilar to said target protein amino acid sequence. 

It is an additional object of the invention to provide a method for generating variant protein sequence 
libraries comprising providing populations of at least two double stranded donor fragments 
corresponding to a nucleic acid template; adding polymerase primers capable of hybridizing to end 
regions of each of said population of donor fragments; generating a population of hybrid double 
stranded molecules wherein one strand comprises a 5'-purification tag and the other strand comprises 
a 5'-phosphorylated overhang; enriching for variant strands by removing strands comprising a 5'-biotin 
moiety; annealing said vartent strands to form at least two double stranded ligation substrates; and. 
ligating said ligation substrates to form a double stranded ligation product wherein said ligation 
product encodes a variant protein. 

These and other objects of the invention are to provide computational protein design and optimization 
techniques via an objective, quantitative design technique implemented in connection with a general 
purpose computer. 

o BRIEF DESCRIPTION OF THE DRAWINGS 

Figure 1 depicts a gene assembly scheme. 

Figure 2 iBustrates that most protein design simulations do not sufficiently may sequence space. As 
shown in the upper graph, most protein design simulations only map the lowest energy basin; thereby 
omitting other low energy basins that could provide viabte sequences for computationally generated 
protein sequences. 

Figure 3 illustrates the point that the alternate low energy basins can represent equally good 
sequences for incorporation into a protein template. This is because the force field representation of 
the energy (i.e.. Ecat) is not necessarily identical to the actual energy (i.e.. Etnje) associated with a 
native protein structure. 

Rgure 4 illustrates the application of teboo for mapping sequence space. The calculated energy 
surface is manipulated based on previous solutions to discourage repeated convergence to the same 
local minimum. 

Figure 5 illustrates clustering algorithms that may be used in the methods of the present invention. 



Figure 6 deplete an example of energy matrix clustering of designed WW domain proteins using a 
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single linkage clustering algorithm. • . 

Figure 7 depicts the data used to generate Figure 7. 

Figure 8 depicts representative structures from cluster 1 , 3, and 9, 

Figure 9 depicts an example of energy matrix clustering of designed SH3 proteins. 

Figure 10 depicts the superfamily of sequences designed for SH3. As shown in Figure 6, the virtual 
super^mlly of sequences designed using an SH3 backbone structure have significant homology to 
the template sequence and other members of the natural SH3 family. Identities with the native 
sequence are highlighted in dark grey. Functional positions are shaded in light grey. Note that 
although the simulations did not include a functional constraint, the native functional residue usually 
appears with low frequency in the alignment 

Figure 11 illustrates coupling patterns in SH3 subfamilies. Interaction-based clustering reveals a 
series of virtual sequence subfamilies that contain various combinations of coupled amino acids 
(highlighted in different shades of grey. Note that some subfamilies differ by amino acids coupled at 7 
positions (medium intensity shading). The amino ackJ couplings lead to multiple low energy solutions 
in different sequence subspaces. As a result, some subfamilies have more similarity to the wild type 
sequence than others. 

Figure 12 depicts the synthesis of a full-length gene and all possible mutations by PGR. Overiapping 
oligonucleotides corresponding to the full-length gene (black bar, Step 1) are synthesized, heated and 
annealed. Addition of Pfu DNA polymerase to the annealed oligonucleotides results in the 5' 3' 
synthesis of DNA (Step 2) to produce longer DNA fragments (Step 3). Repeated cycles of heating, 
annealing (Step 4) results In the production of longer DNA, including some fiilMength molecules. 
These may be selected by a second round of PGR using primers (arrowed) corresponding to the end 
of the full-length gene (Step 5). 



Figure 13 depicts the reduction of the dimensionality of sequence space by PDA technology 
screening. From left to right, 1 : without PDA^" technology; 2: without PDA^'^ technology not counting 
Cysteine, Proline, Glycine; 3: with PDA^" technology using the 1% criterion, modeling free enzyme; 4; 
with PDA^*^ technology using the 1% criterion, modeling enzyme-substrate complex; 6: with PDA™ 
technology using the 6% criterion nrwdeling flree enzyme; 6: with PDA^" technology using the 5% 
criterion modeling enzyme-substrate complex. 
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Figure 14 depicts the active site of B. circulans xylanase. Those positions included in the PDA™ 
technology design are shown by their side chain representation. 

Figure 15 depots cefotaxime resistance of E. coli expressing wild- type (WT) and PDA™ technology. 

Figure 16 depicts a preferred scheme for synthesizing a library of the Invention. The wikJ-type gene, 
or any starting gene, such as the gene for the gtobal minima gene, may be used. Oligonucleotides 
comprising different amino acids at the different variant positions may be used during PGR using 
standard primers. This generally requires fewer oligonucleotides and may result in fewer errors. 

ngure 1 7 depicts and overlapping extension method. At the top of Figure 6 fe the template DNA 
showing the locations of the regions to be mutated (black boxes) and the binding sites of the relevant 
primers (arrows). The primers R1 and R2 represent a pool of primers, each containing a different 
mutation; as described herein, this may be done using different ratios of primers if desired. The 
variant position is flanked by regions of homology sufficient to get hybridization. In thfe example, 
three separate PGR reactions are done for step 1. The first reaction contains the template plus oligos 
F1 and R1. The second reaction contains template plus F2 and R2. and the third contains tiie 
template and F3 and R3. The reaction products are shown. In Step 2, ttie products from Step 1 tube 
1 and Step 1 tube 2 are taken. After purification away from the primers, ttiese are added to a fresh 
PGR reaction together with F1 and R4. During the Denaturation phase of the PGR. the overiapping 
regfons anneal and ttie second strand is synthesized. The product is then amplified by the outside 
primers. In Step 3. the purified product from Step 2 is used in a third PGR reaction, together witti tfie 
product of Step 1 . tube 3 and the primers F1 and R3. The final product con-esponds to the full-Iengtii 
gene and confeins the required niutations. 



Figure 18 depicts a ligation of PGR reaction products to synthesize the libraries of ttie invention. In 
this technique, ttie primers also contain an endonuclease restriction site (RE), either blunt. 5' 
overhanging or 3' overtianging. We set up three separate PGR reactions for Step 1. The first 
reaction contains the template plus oligos F1 and R1. The second reaction contains template plus F2 
and R2. and the third contains the template and F3 and R3. The reaction products are shown. In 
Step 2. the products of step 1 are purified and tiien digested witti ttie appropriate restiiction 
endonuclease. The digestion products from Step 2, tube 1 and Step 2. tube 2 and Igate ttiem 
together with DNA ligase (step 3). The products are tiien amplified in Step 4 using primer F1 and R4. 
The whole process is ttien repeated by digesting the amplified products, ligating them to ttie digested 
products of Step 2. tube 3. and tiien amplifying tiie final product by primers F1 and R3. It would also 
be possibte to ligate all ttiree PGR products from Step 1 togettier in one reaction, providing the two 
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restriction sites (RE1 and RE2) were different 

Figure 19 depicts blunt end ligation of PGR products. In this technique, the primers such as F1 and 
R1 do not overlap, but they abut. Again three separate PGR reactions are performed. The products 
from tube 1 and tube 2 are ligated, and then amplified with outside primers F1 and R4. This product 
is then ligated with the product from Step 1, tube 3. The final products are then amplified with primers 
F1 and R3. 

Figure 20A and B depicts Ml 3 single stranded template production of mutated PGR products. 
Primerl and Primer2 (each representing a pool of primers corresponding to desired mutations) are 
mixed with the M13 template containing the wild type gene or any starting gene. PGR produces the 
desired product (11) containing the combinations of the desired mutations incorporated in Primerl 
and Primer2. This scheme may be used to produce a gene with mutations, or fragments of a gene 
with mutations that are then linked together via ligation or PGR for example. 

Figure 21 A-E depict examples of some preferred combinations. 

DETAILED DESCRIPTION OF THE INVENTION 

As used herein, the following terms shall have the meaning as described below. 

By "altered phenotvpe" or "changed phvsiologv" or other grammatical equivalents herein is meant that 
the phenotype of the cell containing a variable amino acid sequence (preferably an optimized 
sequence) is altered in some way, preferably in some detectable, observable and/or measurable way. 
Examples of phenotypic changes include, but are not limited to: gross physical changes such as 
changes in cell morphology, cell growth, cell viability, adhesion to substrates or other ceils, and 
cellular density; changes in the expression of one or more RNAs, proteins, lipids, hormones, 
cytokines, or other molecules; changes in the equilibrium state (i.e. half -life) or one or more RNAs. 
proteins, lipids, hormones, cytokines, or other nrK)lecules; changes in the tocalization of one or more 
RNAs, proteins, lipids, hormones, cytokines, or other molecules; changes In the bioactivity or specific 
activity of one or more RNAs, proteins, lipids, hormones, cytokines, receptors, or other molecules; 
changes in phosphorylation; changes in the secretion of ions, cytokines, hormones, growth factors, or 
other molecules; alterations in cellular membrane potential, polarization. Integrity or transport; 
changes in infectivity, susceptibility, latency, adhesion, and uptake of viruses and bacterial pathogens; 
etc. By "capable of altering the phenotype" herein is meant that the library member (e.g. the variable 
amino acid sequence and/or the variable nucleic acid sequence) may change the phenotype of the 
cell in Sonne detectable and/or measurable way. 
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By "alternate amino acid" as used herein is meant an amino acid state tliat differs from the arruno acid 
defined by the starting amino acid sequence in the protein design cyde. As outlined below, this 
starting amino acid sequence (e.g. the scaffold protein) may be a wild-type sequence or a variant 
sequence. 

By !.amino acid identity' as used herein is meant the identity of an amino add at a specified position; 
e.g. when the position of an amino acid is spedfied, which one of the 20 naturally occurring or non- 
natural analogs is present at that position. 

By "boundary residues' as used herein is meant, residue positions that are not clearly in the protein 
core or on the proteh surface. Methods for determining boundary residues are outlined below. The 
solvent accessibflity of side chains in boundary positions is determined by the conformation and 
identities of the residues surrounding it. In a prefen-ed embodiment, both hydrophobic and polar 
amino adds can be considered as possible replacement residues at boundary positions. 

"candidate bioactive agent" or "candidate drugs" or grammatical equivalents herein is meant any 
molecule, e.g. proteins (which herein indudes proteins, polypeptides, and peptides), small organic or 
inorganic molecules, polysaccharides, polynudeotides, etc. which are to be tested against a particular 
target Candidate agents encompass numerous chemical classes. In a preferred embodiment, the 
candidate agents are organic molecules, particularly small organic molecules, comprising functional 
groups necessary for structural interacfion with proteins, particularly hydrogen bonding, and typically 
indude at least an amine, carbonyl. hydroxyl or carboxyl group, preferably at least two of the 
functional chemical groups. The candidate agenfe often comprise cydical carbon or heterocydic 
structures and/or aromatic or polyaromatic structures subsUtuted with one or more chemical functional 
groups. A prefenied embodiment is a protein where the uses Indude therapeutic, veterinary, 
agricultural, and industrial applications. 

By a "cellular library" herein is meant a plurality of cells wherein generally each cell witiiin the library 
contains at least one member of the library. Ideally eadi ceH contains a single and different library 
member . although as will be appreciated by those in the art. some cells witiiin «ie library may not 
contain a Bbrary member and some may contain more tiian one library member. When mettiods 
otiier ttian retroviral infection are used to intiwluce the library members into a plurali^ of cells, the 
distiibution of library members witiiin ttie individual cell members of ttie cellular library may vary 
widely, as » is generally difficult to control the number of nucleic adds which enter a cell during 
electroporation and other transformation metiiods. Suitable cell types for cellular libraries are induded 
below. In addition, as will be appredated by ttiose in ttie art. a cellular library generally includes a 
single cell type, alttiough in some embodiments, a cellular library may contain two or more cell types. 
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By "chemically modified" as used herein is meant to include nnodificatlon via chemical reactions as 
well as enzymatic reactions. The substrates in these reactions generally include, but are not limited 
to, aikyi groups (including but not limited to straight and branched alkanes, alkenes. and alkynes), aryl 
groups (including but not limited to arenes and heteroaryl), alcohols, ethers, amines, aldehydes, 
ketones, carboxyiic acids, esters, amides, heterocyclic compounds (including, but not limited to, 
piperidines, pyrrolidines, purines, pyrimidlnes, benzodlazepins, and carbohydrates), sterokls 
(including but not limited to estrogens, androgens* cortisone, ecodysone, etc.), secondary metabolites 
(including, but not limited to, terpenoids, alkaloids, polyketides, beta-lactams, polyether antibiotics, 
and aminoglycosides), organometallic compounds, lipids, amino acids, and nucleosides. The 
reactions generally include, but are not limited to. hydrolysis, reduction, oxidatbn, alkylation, aromatic 
substitutions, electrocydizations, dipolar cydizations, radical anion, radical cation, metal mediated 
couplings, and polymerization. 

By "clustering algorithm" herein is meant an algorithm that may be used to separate a large selectbn 
or set of computationally generated sequences into subsets that represent various sub-regions of 
sequence space. Clustering algorithms are well known in the art, and representative examples are 
outlined below. 

By "control sequences" or "regulatory sequences" as used herein refers to DNA sequences necessary 
for the expression of a gene in a particular host organism. The control sequences that are suitable for 
prokaryotes, for example, include a promoter, optionally an operator sequence, and a ribosome 
binding site. Eukaryotic cells utilize control sequences including, but not limited to, promoters, 
polyadenylation signals, and enhancers. 

By "core ix>sitions" as used herein is meant, positions that are in the interior of a protein or which are 
inaccessible or nearly inaccessible to solvent. Methods for determining which position comprise core 
positions are outlined below. As more fully outiined bebw, in a preferred embodiment, for design 
purposes, only hydrophobic amino acids are considered for incorporation into variable positions at 
core variable positions. As more fully outlined bebw, in an altemate preferred embodiment, polar 
amino acids are considered at core positions only if they form favorable electrostatic or hydrogen 
bond interactions with other polar groups, or if disruption of the scaffold is desired. 

By " coupling* as used herein is meant the non-additive contribution (e.g. synergistic) of two or more 
amino acids to an interaction involving said amino acids. Coupling can be positive (the interaction is 

■ 

more favorable than the sum of the individual contributions), neutral, or negative (the interaction is 
less favorable than tiie sum of the individual contributions). Such coupling typically occurs for amino 
acids located very close in space. 
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^ " decoy state,' 'decoy stnjcture.° or °decov seatjen«>° as used herein is meant a protein sequence 
and structure that is different from a specified reference stale, and that serves as a comparison state 
for use in various paranreter optimization methods. Decoy structures are more fully described betow. 

By rdonor fraqmenf or "donor nucleic acid fragmenf as used herein is meant nucleic acid fragments 
generated firom or corresponding to a template nucleic acid molecule. Preferably, the donor 
fragments are generated using modified primers and a polymerase, although fragments may be 
generated using enzymatic, chemical or physical cleavage (e.g. shearing) of template nucleic add 
molecules. Any DNArt^NA polymerase is suitable; however themiophilic polymerases are prefen-ed. 

An "energy matrix " is defined for the present purposes as follows. A protein design cycle simulation is 
performed to yield a single protein sequence/structure. In the context of this state, all amino adds (in 
an rotamer:states) are sampled at each position or at each variable position. Alternatively, less than 
all rolamer states, or less than all amino adds, are sampled at some or all of the positions. Suita ble 
sampling techniques to generate the energies are outlined herein. The context-dependent energy of 
each ammo acid is stored. An energy matrix is defined by the listing of the context-dependent energy 
of eadi amino add ateadi position of the stmcture. The similarity of two energy matrices (from two 
different simulations) may be defined as the root-mean-squared-deviatlon of two energy matrices. It 
should be noted that in some cases, energy matrices comprising less than all of the possible, 
interactions can be constructed. 



By - filtered set " herein is meant the optimized protein sequences that are generated using some sort 
of selection criteria. Although in some cases, the set may comprise an arbitrary or random selection 
of a subset of the primary sequences. In a preferred embodiment, the filtered set comprises a rank 
ordered Rst of sequences. As outlined herein, this may be done in a variety of ways, including an 
arbitrary cutoff (for example, the top 10.000 sequences ar^ dnosen. or the top 1000 and the bottom 
1000). an energy limitafion (e.g. anything with a total energy calculation below X). or when a certain 
number of residue positions have been varied (e.g. the set is complete when 10 variable positions is 
achieved, etc). As is outlined more fuUy below, filtering can be used as all or part of the primary, 
secondary, tertiary, etc. library generation; that is. filtering can be the sole computational analysis or 
part of a larger analysis, at one or more of the steps of the invention. For example . a primary library 
may be computatfonally generated using PDA. and a filtering step applied to define the set for 
secondary library generation, etc. 

By . 'fixed position" herein is meant, residue positions at which the amino acid identity will be held 
constant in a protein design calculation. In some embodiments, fixed positions may be floated, as 
defined below. That is. in some embodiments, an amino acid identity is kept fixed, but its rotameric 
state is allowed to change. In other embodiments, the amino acid identity and rotameric slate are 
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held constant The conformation and amino acid identity may be that observed in the scaffold 
structure or the confomriation and/or amino acid identity may be different than that observed in the 
scaffoid structure. 

By "floated position" herein is meant, a position at which the amino acid conformation but not the 
amino add identity is allovi^ed to vary In a protein design calculation. The floated position may be 
fixed as a non-wild type residue. For example, when known site-directed mutagenesis techniques 
have shown that a particular amino acid is desirable (for example, to eliminate a proteolytic site or 
alter the substrate specificity of an enzyme), the position may be constrained to allow only that amino 
acid. Altematlvely, the methods of the present invention may be used to evaluate specific mutations 
de novo. 

By " gene assembly procedures" as used herein is meant either enzymatic or chemical methods of 
pining gene fragments. A wide variety of exemplary methods are included herein and described 
below. 

By " global optimum protein sequence" as used herein is meant an amino acid sequence that best fits 
the mathematical equations of the computational process. As will be appreciated by those in the art, 
a global optimum sequence Is the sequence that has the lowest energy or best score of any possible 
sequence in the context of the particular computational analysis utilized . That is, the global optimum 
sequence depends on the scoring or ranking systems used, and may change with different 
computational parameters. For example, when PDA*^ is used, the global optimum will depend on the 
scoring functions utilized, the weighting factors, etc. In addition, there are any number of sequences 
that are not the global nrunimum but that have low energies or favorable scores referred to herein as 
"optimized sequences", defined below. 

By "labeled" herein is meant that nucleic acids, proteins, candidate agents, antibodies or other 
components of the invention have at least one element, isotope, or chemical compound attached to 
enable the detection of nucleic acids, proteins and antibodies of the Invention. 

By " ligation product" as used herein is meant either the single stranded or double stranded nucleic 
acid molecule resulting when at least two ligation substrates are ligated together. 

By " ligation substrate" as used herein is meant either a single or double stranded nucleic acid 
molecule formed by annealing from two complementary donor fragments in which one donor fragment 
has a 5'-phosphoryIated overhang and the other fragment has a free 3'-terminus (see Figure 1). 

By " nucleic acid template" herein is meant a single or double stranded nucleic acid. In a preferred 
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embodiment, the nucleic acid template is used to generate donor fragments, defined above. The 
donor fragments may be obtained directly from the nucleic acid template or separately obtained, e.g., 
by nudelc add synthesis, fragmentation (e.g. enzymatic, chemical or physical) or amplification 
reactions. A nudeic acid template may comprise an intact gene, or a fragment of a gene encoding 
functional domains of a protein, such as enzymatic domains, regulatory sequences, binding domains, 
etc., as well as smaller gene fragments The template nudeic acid may be from any organism, either 
prokaryotic or eukaryotic. The template sequence may be naturally occuning, a variant, a product of 
a computational step, etc. 

By "nucleoside" as used herein, Indudes nucleotides, nucleosides and analogs, induding modified 
nucleosides such as amino modified nucleosides and Includes non-naturally occurring analog 
structures, i.e. the individual units of a peptide nucleic acid, each containing a base, are referred to 
herein as a nucleoside. 

By "operably linked" as used herein means two or more nucleic adds linked together such that the 
desired functionality is achieved. For example, when a first nuciek: add sequence is placed into a 
functional relationship with another nudeic add sequence. For example, DNA for a presequence or 
secretory leader is operably linked to DNA for a polypeptide if it is expressed as a preprotein that 
participates in the secretion of the polypeptide; a promoter or enhancer is operably linked to a coding 
sequence If it affects the transcriptton of the sequence; or a ribosome binding site is operably linked to 
a coding sequence if it is positioned so as to fadlitate translation. Generally, operably linked DNA 
sequences are contiguous, and in the case of a secretory leader, contiguous and in reading phase. 
However, enhancers do not have to be contwfuous. Linking can be accomplished by ligation at 
convenient resfriction sites. If such sites do not exist, the synthetic oligonucleotide adaptors or linkers 
are used in accordance with conventional practice 

"optimized protein sequence" as used herein is meant a sequence with at least one optimized 
property. For example, in the context of a particular computational analysis, an optimized sequence 
will exhibit a low energy or favorable score. For example, when PDA™ is used, an optimized 
sequence is one which has a lower energy than the energy of the starting scaffold protein. 
Alternatively, an optimized protein sequence may have one or more protein properties, defined below, 
that are desirably different as compared to the starting scaffoW protein. An optimized protein 
sequence may or may not be ttie global optimum sequence, however, an optimized protein sequence 
has at least one amino add substitution, insertion or deletion as compared to tfie starting scaffold 
protein used to generate the optimized sequence. 

By a 'pluraHty ofcellsl herein is meant roughly from about 10^ cells to 10^, 10® or 10®. witii from 10® to 



10° being preferred. 
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^ "position" as used herein is meant a location in the sequence of a protein. Positions are typically 
numbered using the protein numbering scheme described below. In the context of a given scaffold 
protein, each position is associated with the location and/or orientation of its associated backbone 
atoms in three dimensions. Consequently, positions may be described by their secondary structure 
and by whether an amino acid located at that position would be solvent exposed or buried in the 
protein core. 

By "presentation scaffold" or "presentation structure" as used herein is meant a protein stmcture that 
allows the scaffold protein, generally a peptide, to take on a certain conformation. For example, there 
are a virtde variety of "ministructures" known, sometimes refen-ed to as "presentation structures", that 
can confer conformational stability or give a random sequence a conformaConally restricted form. 
Proteins interact with each other largely through conformationally constrained domains. Although 
small peptides with freely rotating amino and carboxyl termini can have potent functions as is known 
in the art, the conversion of such peptide structures into pharmacologic agents is difficult due to the 
inability to predict side-chain positions for peptidomimetic synthesis. Therefore the presentation of 
peptides in conformationally constrained structures will benefit both the later generatbn of 
phamnaceuticals and will also likely lead to higher affinity interactions of the peptide with the target 
protein. This fact has been recognized in the combinatorial library generation systems using 
biologically generated short peptides in bacterial phage systems. A number of workers have 
constructed small domain molecules in which one might present randomized peptide stnjctures. 
Thus, synthetic presentation structures, i.e. artificial polypeptides, are capable of presenting a 
randomized peptide as a conformationally-restricted domain. In addition, random peptide structures 
that are not totally random, l.e., that are selected or filtered as described herein may be presented. 
Preferred presentatbn structures maximize accessibility to the peptide by presenting it on an exterior 
loop. Accordingly, suitable presentation structures include, but are not limited to, minibody structures, 
loops on beta-sheet turns and coiled-coil stem staictures in which residues not critical to structure are 
randomized, zinc-finger domains, cystetne-linked (disulfide) structures, transglutaminase linked 
structures, cyclic peptides, B-toop structures, helical barrels or bundles, leucine zipper motifs, etc. 

By "primary library" as used herein is meant a collection of sequences, preferably optimized and 
generally, but not always, in the form of a filtered set. a rank-ordered list (e.g. a scored or sampled 
set), an alignment, a probability distribution table, etc. A primary library is generated as a targeted 
subset of all or a portion of the sequence space for a particular scaffold protein. That is, a primary 
library is generated using any number of techniques, either alone or in combination, to reduce the size 
of the set of sequences likely to take on a particular fold or have a particular protein property. The 
primary library preferably comprises a set of sequences resulting from computation, which may 
include energy calculations and/or statistical or knowledge based approaches. In general, it is 
preferable to have the primary library be large enough to randomly sample a reasonable sequence 
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space to allow for robust secondary libraries. Thus, primary libraries that range ftom about 50 to 
about 10" are preferred, with from about 1000 to about 10^ being particularly preferred, and from 
about 1 000 to about 1 00,000 being especially preferred. 

By " probability parameter° as used herein is meant a parameter that governs the rate at which a given 
amino acid or rofamer state is sampled during a simulation. 

B y "protein: as used herein is meant at least two amino acids linked together by a peptide bond. As 
used herein, protein includes proteins, oligopeptides, polypeptides and peptides. The pepBdyl group 
may comprise naturally occurring amino acids and peptide bonds, or synthetic peptidomimetic 
structures, i.e. "analogs", such as peptoids (see Simon et al.. PNAS USA 89(20):9367 (1992)). The 
amino acids may either be naturally occurring or non^iaturally occurring. The side chains may be in 
either the (R) or the (S) configuratbn. In a preferred embodiment, the amino acids are in the (S) or 
L-configuration. 

"protein numbering scheme" herein is meant, the manner in which, as is known in the art, the 
residues, or positions, of proteins are generally numbered. The residues, or positions, are generally 
sequentially numbered starting with the N-tenninus of the protein. Thus a protein having a methionine 
at its N-terminus is said to have a methionine at residue or amino acid positk)n 1 , with the next 
residues as 2. 3. 4, etc. In some embodiments, a set of aligned proteins is numbered together. In 
such cases, insertions relative to the consensus sequence are denoted by adding a letter after the 
number; for example, a one-residue insertfon between positfons 1 and 3 would produce the 
numbering 1. 2a. 2b. 3. Similarly, deletions relative to the consensus sequence are denoted by 
skipping a numben for example, a one residue deletion between positions 1 and 3 would produce the 
numtiering 1 , 3. 

By "protein properties" herein is meant, biological, chemical, and physical properties including, but not 
limited to. enzymatfc activity, specificity (including substrate specificity, kinetic associatran and 
dissociation rates, reaction mechanism, and pH profile), stability (including thermal stability, stability 
as a function of pH or solution condittons, resistance or susceptibility to ubiquifinaton or proteolytic 
degradation), solubility, aggregation, stmctural integrity, the creatwn of new antibody CDRs. generate 
new DNA. RNA binding, generate peptkJe and peptidomimmetic libraries, crystallizability. binding 
affinity and specificity (to one or more molecules Including proteins, nucleic acWs, polysaccharides, 
lipids, and small molecules), oligomerization state, dynamic properties (including conformational 
changes, allostery. correlated motions, flexibility, rigidity, folding rate), subcellular tocalizatwn. ability 
to be secreted, ability to be displayed on the surface of a cell, posttranslational modification (including 
N- or C-finked glycosylation, lipklaBon. and phosphorylation), ammenability lo synthetic modification 
(including PEGylation, attachment to other molecules or surfaces), and abOity to induce altered 
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phenotype or changed physiotogy (including cytotoxic activity, immunogenidty, toxicity, ability to 
signal, ability to stimulate or inhibit cell proliferation, ability to induce apoptosis, and ability to treat 
disease). As is outlined herein, protein properties may be modulated using the techniques of the 
invention. When a biological activity is the property, modulation in this context includes both an 
increase or a decrease in activity. 

By "pseudo energy* ' as used herein is meant an energy-like term derived from non*energetic 
information. Such pseudo energies are typically used as a mechanism for combining non-energetic 
information with energy based scoring liinctions. For example, statistical information arising from 
structural analysis, sequence alignments, or simulation history may be incorporated into a calculation 
by their conversion to pseudo energies. 

By " recency parameter" as used herein means the application of at least one restraint to the most 
recent moves of a simulation (see Modern Heuristic Search Methods, edited by V.J. Rayward-Smith, 
et al., 1996, John Wiley & Sons Ltd.. hereby expressly incorporated by reference in Its entirety). 

By "residue" as used herein is meant an amino acid side chain. A residue may be one of the naturally 
occurring amino acid side chains or a synthetic analog. 

» 

By "scaffold protein'' herein is meant a protein for which a library of variants is desired. The scaffold 
protein is used as Input in the protein design calculations, and often is used to facilitate experimental 
library generation. A scaffold protein may be any protein that has a known structure or for which a 
structure may be calculated, estimated, modeled or determined experimentally. As outlined more fully 
below, the scaffold protein may be a wild-type protein from any organism, a variant, a chimeric 
protein, etc. Preferred embodiments of scaffold proteins are outlined below. 

By "secondary library" as used herein is meant a library of amino acid sequences that is derived from 
a primary library using, a variety of approaches discussed further below, including both experimental 
and computational methods, or combinations thereof. Secondary libraries are generally generated 
experimentally and analyzed for the presence of members possessing desired protein properties. 
The secondary library may be either a subset of the primary library, or contain new library members, 
i.e. sequences that are not found in the primary library. The secondary library typically comprises at 
least one member sequence that is not found In the primary library, and preferably a plurality of such 
sequences, although this is not required. 

By " selectable gene." "selection gene" or "selectable mari<er" as used herein is meant any gene that 
enables survival and/or reproduction of the cells that express it. The marker gene may confer 
resistance to a selection agent such as an antibiotic, or may provide a protein required for growth. 
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By jsequence space' herein is meant all sequential combinations of amino acids that are possible for 
a defined protein and a defned set of positions thereof. For example, the sequence space for all 
positions of a 100-residue protein is 20^*, and the sequence space for ten selected positions of a 
protein would be 20^°, if only the twenty naturally occumng amino acids are considered. 

By "shufflinq°. as used herein means recombination of one or more protein. DNA. or RNA sequences. 
Shuffling may be done e)q)erimentally and/or computationally (e.g. "in silico shuffling"). See for 
example, U.S. Patent 6.319.714; WO 0042559WO 00/42560; and WO 00/42561. 

By "solid supporr or other grammatical equivalents herein is meant any material that may be modified 
to contain discrete individual sites appropriate for the attachment or association of beads, other solid 
support surfaces not in solution, and is amenable to at least one detection method. As will be 
appreciated by those in the art. the number of possible supports is very large. Possible soOd supports 
include, but are not limited to, glass and modified or ftincBonarized glass, plastics (including acrylics, 
polystyrene and copolymers of styrene and other materials, polypropylene, polyethylene, 
polybutylene, polyurethanes, Teflon®, etc.). polysaccharides, nylon or nitrocellulose, resins, silica or 
silica-based materials including silicon and modified silicon, carbon, metals, inorganic glasses, 
plastics, optical fiber bundles, and a variety of other polymers. In general, the solid supports allow 
optical detection and do not themselves appredably fluoresce. 

By "sticky end' as used herein is meant the end of an enzymatically cleaved DNA fi^gment that has 
eittier a 5" or 3" overhang, and has the potential to interact favorably with another sticky end wifli 
similar properties. 

By "surface positions" as used herein is meant amino acid positions within a scaffold protein (or a 
variable protein) wiUi a significant degree of solvent accessibility. Methods for the determination of 
suriace positions are outlined betow. In a prefened embodiment, only polar amino acids are 
considered as possible replacement residues at surface positions in protein design calculations. 

By "tabu search algorithms" as used herein is meant any algorittims from the dass of searching 
mettiods in which searching moves are made such ttiat moves already made, or made recently in the 
history of the search, are eittier avoided or disfavored. 

By "tertiary library" as used herein is meant a library that is generated by computational or 
experimenfal modification or manipufation of a secondary library. 



By "variant protein sequence" as used herein is meant a protein sequence that differs from anoUier 
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protein sequence. In other words a variant protein sequence has at least one amino acid that differs 
from the amino acid defined by the starting amino add sequence in the protein design cycle. As 
outlined below, this starting amino acid sequence (e.g. the scaffold protein) may be a wild-type 
sequence or a variant sequence. 

By "variable residue positjon" herein is meant a position at which both the amino acid identity and 
confornnation are allowed to be altered in a protein design calculation. The amino acid identity to 
which a position may be . mutated may be the full set or a subset of the 20 naturally occumng amino 
acids or may be a set of non-naturaliy occurring amino acids or synthetic analogs. 

By "temperature factor" as used herein is meant a parameter in an optimization algorithm that 
determines the acceptance criteria for a sampling jump. As will be appreciated by those skilled in the 
art, high temperature factors allow searches across a broad area of sequence space, and low 
temperature factors altow searches over a narrow region of sequence space. See Metropolis et al., J. 
Chem Phys v21. pp 1087, 1953. hereby expressly Incorporated by reference. 

By " variant strand " as used herein is meant a nucleic acid strand generated using the gene assembly 
methods outlined herein to differ from the conresponding template nucleic acid sequence by at least 
one nucleotide or its complement 

All references cited herein are expressly incorporated by reference. 
Introduction 

The present inventbn is directed to methods of using computational screening of protein sequence 
libraries (that may comprise up to 10®° or more members) to select smaller libraries of protein 
sequences (that may cornprise up to 10^^ members), which may then be used In a number of ways. 
For example, the proteins may actually synthesized and experimentally tested in the desired assay to 
identify proteins that possess desired properties. Similariy, the library may be subjected to additional 
computational manipulation In order to create a new library, which may be experimentally tested. 

As may be appreciated by those skilled in the art, a variety of user interfaces may be utilized in the 
present invention. In a preferred embodiment, the interface is designed to maximize usability and 
efficiency. Furthemiore. any or all of the computational methods described below may be automated 
for increased usability and efficiency. 

Computational screening to enrich libraries with proteins possessing desired properties 

By computationally screening very large libraries of variant proteins, a greater diversity of protein 
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sequences may be screened (i.e. a teller sanrjpling of sequence space) than is possible using 
experimental methods alone. Consequently, the probability of identifying proteins with desired 
properties is increased and greater improvements may be reafized compared to the resulte of purely 
experimental methods. 



The nun*er of possible protein sequences grows exponentially with the number of positions that are 
randomized. Generally, only up to 10" - 10" sequences may be contained in a physical library 
because of experimental and physical constraints (e.g. transfonnation effidency. instnjmentation 
limits, the cost of producing la^e numbers of biopolymers, and. for larger Obraries. the number of 
carbon atoms in the universe, etc.) Often, practical considerations may limit the library size to 10* or 
fewer. These limits are reached for only 10 amino add positions. In contrast, using the automated 
protein design techniques outlined betow, virtual libraries of protein sequences that are vastly larger 
than experimental libraries may be generated and analyzed: up to 10** or more candidate sequences 
may be screened computationaliy. 

Using experimental methods alone, only a sparse sampling of sequences is possible in the search for 
proteins or peptides with desired properties, towering the chance of success (both finding any proteins 
that possess the desired properties, and finding pnateins that surpass the minimum acceptable 
criteria) and almost certainly missing desirable candidates. Because of the random nature of the 
mutations in experimental libraries, most of the candidates in the library are not suitable (for example, 
a large fraction of sequence space encodes unfolded, misfolded. incompletely folded, partially folded . 
or aggregated proteins), resulting in an enormous vraste of the time and resources required to 
produce the library. In effect, when experimental methods alone are used, the screened library is 
composed of a large amount of "wasted sequence space". 

Computational pre-screening may be used to generate and/or enrich libraries of variant proteins that 
possess desired protein properties. An experimental library consisting of the favorable candidates 
Ibund in the virtual library screening may then be generated, resulting in a much more efficient use of 
the time, money and effort required to construct and screen an experimental library. In effect, when 
computational pre-screening is used the screened library is composed of primarily productive 
sequence space. As a result, computational pre-screening increases the chances of identilying one 
or more proteins that possess the desired protein properties. 

Computational pre-screening may also be benefidal when the library of mutants is suffidently small to 
be saeened experimentally (that is. a library size of less than 10^^ li reduces tiie number of mutante 
that must be tested experimentally, thereby reducing the cost and difficulty assodated with protein 
engineering and experimental screening. 
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While experimental methods are typically limited to 10^ - 10^^ sequences, computational methods 
have the unique ability to screen 10®° sequences or more. However, purely computational methods 

« 

are limited by an incomplete knowledge of the structure-function relationship in proteins. In contrast, 
experimental methods are capable of identifying sequences with desired protein properties, even in 
cases where the causative link between sequence and observed protein properties is not understood. 
Thus, computational pre-screening followed by experimental screening of the most promising 
constructs combines the best features of computational and experimental nrtethods. 

Computational screening for target identification 

In a preferred embodiment, the present invention finds use in the screening of random peptide 
libraries for the purpose.of target identification. In this application, random peptides are screened for 
the ability to cause a phenotypic alteration. Foltowing identification of the active peptides, their 
interaction partners, which will typically be other proteins, may be determined. These proteins are 
likely to be involved In tiie biochemical pathway associated with a given phenotypic alteration, and 
therefore could potentially serve as new drug targets. This approach is analogous to ttie chemk:al 
genetics methods that have been devebped for small molecule libraries (Chen et al. 6:221-235 
(1999). Knockaert et al. Chem. Biol. 7(6):411-22, (2000)). 

Screening small molecule libraries for compounds tiiat are capable of inducing specific alterations in 
cellular physiology or phenotype has led to the discovery of proteins that function in a variety of 
biochemical and signal transduction pathways. Cyclosporin A (CsA) and FK506, for example . were 
selected in standard pharmaceutical screens for inhibition of T-cell activation. It is noteworthy tiiat 
while tiiese two drugs bind completely different cellular proteins, cyclophilin and FK506 binding 
protein (FKBP), respectively, the effect of either drug is virtually the same: profound and specific 
suppression of T-cell activation, phenotypically observable in T cells as inhibition of mRNA production 
dependent on transcription factors such as NF-AT and NF-KB. 

Chemical genetics approaches have typically used libraries of small molecules; however, libraries of 
peptides or proteins could be used instead. Computational pre-screening of the peptide libraries 
could be used to maximize the diversity of properties In ttie library and to select sti'uctured peptides 
that are especially likely to bind otiier molecules with high affinity. 

Comoutational screening for fold identification 

The present invention also finds use in fold identification. Structural and functional properttes of 
protein sequences, such as those deriving from various genome projects, may often be inferred from 
sequence similarity to proteins whose structural and/or functional properties have been characterized. 
One limitation of Uiis approach is that many newly discovered sequences lack sufficient sequence 
similarity with any of the better characterized proteins. 
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In a preferred embodiment, a three^limensional database is created by modifying a known protein 
structure to incorporate particular amino acid residues required for a chaiacteristic property or 
function, as is described in WO 00/23474. expressly incorporated herein by reference. This allows 
the creation of a database that can be used in a manner similar to other 'structural alignmenf 
programs. That is. by using the protein design cyde systems outlined herein, a variety of amino acid 
sequences that will take on a particular structural fbW are generated. These sequences represent a 
set of artificial sequences that will take on a partfcular confbrmatfon. This database may be searched 
against protein databases to identify new proteins having structural similarify with the known protein. 
Thus, proteins can be identified that make take on a particular fold but do not have enough sequence 
homotogy to a naturally occurring protein to be chosen using known alignment programs. In some 
cases, this wBI allow the assignment of putative functional information as well; for example, by 
Identifying proteins with structural homology to a particular class of enzyme or ligand. the new protein 
can be assigned simOar fiinctton. This finds particular use in Wentifying proteins that have been 
sequenced but to which no staicture and/or function has been assigned. 

In addition, the database could contain additional computatfonalfy generated sequences that are 
predicted to be compatibte with a given structore and/or function. Computatronalfy supplemented 
databases may contain a significantly greater diversify and totel number of sequences than databases 
that rely solely on experimental results. Consequentfy, the ft^action of sequences that may be 
classified into a protein family wHI be larger using a computationaify supplemented database than 
using a purefy experimental database. FoW identification using PDA™ technotogy and bioinformatics 
tools (e.g. dynamic programming algorithms, BLAST search), may then be used to ktentify new drug 
targets and antidotes to btologlcal weapons. 

As an example of this concept, the sequencing of new genomes will reveal pnsteins, structural mofife. 
and domains that are unique to certain genomes. For exampte. there may be some domains that are 
unk^ue to bacterial or viral genomes and do not exist in eukaryotic genomes. PDA™ technology 
and/or the other computetkjnal methods outlined herein may be used to Wentify sequences that are 
compatible with these structures. Bacterial and viral genomes may then be searched to identify 
additional proteins that are likely to fold to the structures, but could not be identified as homotogs 
using tradifional methods. The resulting proteins may serve as novel drug targets that could be used 
to discover new classes of antibfotfcs and antiviral drugs. 

Approach to library generation 

The invention describes novel methods to create secondary libraries derived from very large 
computational mutant Ubraries. These methods alk>w the rapkl experimental and/or computational 
testing of large numbers of computationaify designed sequences. 
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As more fully outlined below, the invention may take on a wide variety of configurations. In general, 
priniary libraries are generated computationally. This may be done in a wide variety of ways, 
including, but not limited to, sequence alignments of related proteins, structural alignments, structural 
prediction models, SCMF methods, or preferably protein design automation*™ (PDA'^'^) technology 
computational analysis. 

Once the primary library-is generated, it may be manipulated in a variety of ways. In one 
ennbodiment. a different type of computational analysis may be done; for example, a new type of 
ranking may be performed. In a preferred embodiment, some subset of the primary library is then 
experimentally generated to form a secondary library. Alternatively, some or all of the primary library 
members are recombined to form a secondary library, resulting in a secondary library that contains 
sequences not included in the primary library. Again, this may be done either computationaily or 
experimentally or both. 

Accordingly, the present Invention provides computational and experimental methods tor generating 
secondary libraries of scaffold protein variants. 

Oven^iew of PDA™ Technoloqv Methodology 

In a prefen'ed embodiment, the computational method used to generate the primary library is Protein 
Design Automation™ (PDA™) technology, as is described in U.S.S.N.s 60/061,097, 60/043,464, 
60/054,678. 09/127,926 and PCT US98/07254, all of which are expressly incorporated herein by 
reference. Briefly, PDA^ technology may be described as follows. A known, generated or 
homologous protein structure is used as the starting point The residues to be optimized are then 
identified, which may be the entire sequence or subset(s) thereof. The side chains of any positions to 
be modified are then removed. The amino acids that will be considered at each position are selected, 
(for example, core residues generally will be selected from the set of hydrophobic residues, surface 
residues generally will be selected from the hydrophilic residues, and boundary residues may be 
either). Each amino add residue may be represented by a discrete set of allowed confbmiations, 
called rotamers. interaction energies are calculated between each residue In a given rotamer and the 
backbone and between each pair of residues in each of their rotamers at different positions. 
Combinatorial search algorithms, typically DEE and Monte Carlo, are used to identity the optimum 
amino acid sequence and additbnal low energy sequences which will comprise the primary library. 

PDA^^ technology, viewed broadly, has four components that may be varied to alter the output (i.e. 
the primary library): generation of the template or templates, choice of amino acid identities and 
conformations conskJered at each position, the scoring functions used in the process; and the 
optimization strategy. 
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Selection and preparation of the scaffold protein 

Source of Thrpp-rtim ensional Structure 

The scaffold protein may be any protein for which a three dimensional structure (that is. three 

dimensional coordinates for each atom of the protein) is l<nown or may be generated. The three 

dimensional structures of proteins may be detemiined using X-ray oystailographic techniques. NMR 

techniques, de novo modeling, homology modefing. etc. In general, if X-ray structures are used. 

structures at 2 A resolution or better are preferred, but not required. Suitable protein structuies 

Include, but are not limited to. all of those found in the Protein Data Base compiled and serviced by 

the Research CoHaboratory for Structural Biolnfomiatics (RCSB. fomierly the Brookhaven National 
Lab). 

Scope of Scaffolfl 

The scaffold used in protein design calculations may comprise an entire protein or peptide, a subset 
of a protein such as a domain (including functional donr»ains such as enzymatic domains, substrate- 
binding domains, regulatory domains, dimerization domains, etc.). motif, site, or loop. The scaffold 
protein may comprise more than one protein chain. That is. the scaffold may be an oligomer 
(including but not limited to dimers. trimers. hexamers. 60-mers such as viral coats, and long protein 
chains such as actin filaments) or a multi-protein complex (including but not limited to Sgand-ieceptor 
pairs, antibody-antigen pairs, ribosome complexes, proteosome complexes, transcription complexes 
chaperone complexes, the spBcesome. molecular motors, focal adheston complexes, multi-protein ' 
signaling complexes, etc.). The scaffold may additionaliy contain non-protein components, including 
but not limited to small molecules, substrates, cofactors. metals, water molecules, prosthetic groups 
nucleic acids such as DNA and RNA, sugars, and lipids. 

Source of the scaffold protein 

The scaffold proteins may be from any organism, including prokatyotes and eukaryotes. with proteins 
from bacteria, fongi. vfruses, extremophiles such as the archaebacteria, insects, fish, animals 
(particularly mammals and particularly human) and birds all possible. The scaffold protein does not 
necessarily need to be naturally occurring, for example the scaffoW protein could be a designed 
protein, or a protein selected by a variety of methods including but not limited to directed evolution 
(Fannas et al. Current Opinkm in Biotechnotogy 12:545-551 (2001) Morawski et al. Biotechnology and 
Bfoengineering 76:99-107 (2001). Stemmer Nature 370(6488): 389-91 (1994) Ness et al Adv 
Protein. Chem. 55.-261-92 (2000)). DNA shuffling (Maxygen. Enchira. Diverse) or ribosome display 
(Hanes et al. Methods in Enzymotogy 328:404-430 (2000); Hanes and Pluckthun. Proc. Natl Acad 
Sc.. USA 94:4937-4942 (1997); Roberts and Szostak. Proc. NaM. Acad. Sci USA 94 12297-302 
(1997). • . 
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Examples of sultabie scaffolds 

As will be appreciated by those skilled in the art, any number of scaffold proteins find use in the 
present invention. Suitable proteins include, but are not limited to, industrial and pharmaceutical 
proteins, including ligands, cell surface receptors, antigens, antibodies, cytokines, honnonjes, 
transcription factors, signaling modules, cytoskeletai proteins and enzymes. 

Specifically, preferred scaffokJ proteins include, but are not limited to, those with known or predictable 
structures (including variants): 

• cytokines (IHra (+receptor complex), IH (receptor atone), IL-1a, IL-1b (including variants 
and or receptor complex), IL-2, IL-3. lL-4, IL-5, IL-6, IL-8, IL-10, IFN-p, INF-y, IFN-a-2a; IFN- 
a-2B, TNF-^i; CD40 ligand (chk), Human Obesity Protein Leptin, Granulocyte Colony- 
Stimulating Factor, Bone Morphogenetic Protein-7, Ciliary Neurotrophic Factor, Granutocyte- 
Macrophage Cotony-Stimulating Factor, Monocyte Chemoattractant Protein 1, Macrophage 
Migratton Inhibitory Factor, Human Glycosylation- Inhibiting Factor, Human Rantes, Human 
Macrophage Inflammatory Protein 1 Beta, human growth hormone, Leukemia Inhibitory 
Factor, Human Melanoma Growth Stimulatory Activity, neutrophil activating peptide-2, Cc- 
Chemokine Mcp-3, Platelet Factor M2, Neutrophil Activating Peptkle 2, Eotaxin, Stromal Cell- 
Derived Factor-1, Insulin, Insulin-like Growth Factor I, Insulin-like Growth Factor il, 
Transforming Growth Factor 81, Transforming Growth Factor 62, Transforming Growth 
Factor 83, Transforming Growth Factor A, Vascular Endothelial growth factor (VEGF), acidic 
Fibroblast growth factor, bask: Fibroblast growth factor. Endothelial growth factor, Nerve 
growth factor, Brain Derived Neurotrophic Factor, Ciliary Neurotrophic Factor, Platelet 
Derived Growth Factor, Human Hepatocyte Growth Factor, Rbroblast Growth Factor 
(including but not limited to alternative splk:e variants , abundant variants, and the like), Glial 
Cell-Derived Neurotrophic Factor, and haemopoietic receptor cytokines (including but not 
limited to erythropoietin, thrombopoietin, and prolactin), APM1 (including, but not limited to 
adipose most abundant gene transript 1 ), and the like. 

• other extracellular signaling moieties, including, but not limited to, Sonic hedgehog, protein 
hormones such as chorionic gonadotrophin and leutenizlng hormone. 

• blood clotting and coagulation factors including, but not limited to, TPA and Factor Vila; 
coagulation factor IX; coagulation factor X ; PROTEIN S protein; Fibrinogen and Thrombin; 
ANTITHROMBIN III; streptokinase and urokinase, retevase. and the like. 

• transcription factors and other DNA binding proteins, including but not limited to, histones, 
p53; myc; PIT1; NFkB;AP1;JUN; KD domain, homeodomain, heat shock transcription factors, 
Stat, zinc finger proteins (e.g. zif268). 

• Antibodies, antigens, and trojan horse antigens, including, but not limited to, immunoglobulin 
super family proteins, including but not limited to CD4 and CD8, Fc receptors, T-cell 
receptors, MHC-I, MHC-ll, CD3, and the like. Also, immunoglobulin-like proteins, including 
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but not limited to fibronectin. pki domain, integrin domains, cadtirin. invaslns. ceil surface 

receptors with Ig-like domains, and the like. Intrabodies. and the like; Anti-HerC neu antibody 

(e.g. Herceptin); AnB-VEGF: Antl-C020 (RItuxan), among others. 

intracellular signaling modules, including, but not limited to, kinases, phosphatases, G- 

proteins Phosphatidylinositol 3-kinase (PI3-kinase) kinase, PhosphatkJyDnositol 4-kinase. wnt 

family members including but not limited to wnt-1 through wnt 15, EF hand proteins including 

calmodulin, troponin C. S100B, calbindin and D9k; NOTCH; MEK; MAPK; ubitquitin and 

ubiquitin like proteins, including UBL1 , UBL5, UBL3 and UBL4, and the like. 

viral proteins, including, but not Hmited to, hemagglutinin trimerization domain and HIV Gp41 

ectodomain (fusion domain); viral coat proteins, viral receptors, integrases, proteases, 

reverse transcriptases. 

receptors, including, but not limited to, the extracellular regton of human tissue factor 
cytotane-binding region Of Gp130. G-CSF receptor, erythropoietin receptor, Rbroblast Growth 
Factor receptor, TNF receptor, IL-1 receptor, IL-1 receptor/lL1ra complex, IL^ receptor, INF-y 
receptor alpha chain, MHC Class I, MHC Class II . T Cell Receptor. Insulin receptor, insulin 
receptor tyrosine kinase and human growth homrmne receptor; Lectins; GPCRs. including but 
not limited to G-Protein coupled receptors; ABC Transporters/ Multidrug resistance proteins; 
Na and K channels; Nuclear Homnone Receptors; Aquaporins; Transporters, RAGE (receptor 
for advanced glycan end points). TRK -A. -B, -C. and the like, and haemopoletic receptors, 
enzymes including, but not limited to. hydrolases such as proteases/proteinases. 
synthases/synthetases/ligases. decarboxylases/lyases. peroxWases. ATPases, 
carbohydrases. lipases; isomerases such as racemases. epimerases. toutomerases. w 
muteses; transferases, hydrolases, kinases, reductases/oxidoreducteses. hydrogenases. 
polymerases, phophatases, and proteasomes anti-proteasomes. (e.g.. MiJ^341). Suitable 
enzymes include, but limited to, those listed in the Swiss-Prot enzyme database. 
Additional proteins including but not limited to heat shock proteins, ribosomal proteins, 
glycoproteins, motor proteins, transporters, dmg resistance proteins, kinetoplasts and 
dtaperonins. 

Antinvcrobial peptides 

small proteins including but not limited to metal ligand and disulfide-bridged proteins such as 
meteflothbnein, Kunitiz-type inhibitors, crambin. snake and scorpion toxins, and trefoil 
proteins; antimicrobial peptides such as defensins. thoredoixn. fereodoxin, transferetin and 
theRke. 

protein domains and motifs including, but not limited to, SH-2 domains. SH-3 domains. 
Pleckstrin homology domains. WW domains, SAM donnains, kinase domains, death domains. 
RING finger domains. Kringle domains, heparin-binding domains, cysteine^ch domains, 
leucine zipper domains, zinc finger domains, nucteotide binding motifs, transmembrane 
helfoes. and helix-turn4ielix motifs. Additionally. ATP/GTP-binding site motif A; Ankyrin 
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repeats; fibronectin domain; Frizzled (fe) domain; GTPase binding domain; C-type lectin 
domain; PDZ domain; 'Homeobox* domain; Krueppel-associated box (KRAB); Leucine zipper; 
DEAD and DEAH box femilies; ATP-dependent helicases; HMG1/2 signature; DNA mismatch 
repair proteins mutL / hexB / PMS1 signature; Thioredoxin feimily active site; Thioredoxins; 
Annexins repeated domain signature; Clatlirin iigiit chains signatures; Myotoxins signature; 
Staphylococcal enterotoxins / Streptococcal pyrogenic exotoxins signatures; Serpins 
signature; Cysteine proteases inhibitors signature; Chaperonins; Heat shock; WD domains; 
EGF-like domains; lmmunogk>bulin domains, Immunoglobulin-like proteins and the like, 
specific protein sites or other subsets of resklues, including but not limited to protease 
cleavage/ recognition sites, phosphorylation sites, metal binding sites, and signal sequences. 
Additionally, proteins having post-translational modifications include, but are not limited to: N- 
glycosytation site; 0-glycosylation site; Glycosaminoglycan attachment site; Tyrosine sulfetbn 
site; cAMP- and cGMP; dependent protein kinase phosphorylatbn site; Protein kinase C 
phosphorylation site; Casein kinase II phosphorylation site; Tyrosine kinase phosphorylation 
site; N-myristoylation site; Amtdation site; Aspartlc acid and asparagine hydroxylatbn site; 
Vitamin K-dependent carboxylation domain; Phosphopantetheine attachment site; Prokaryotk: 
membrane lipoprotein lipid attachment site; Prokaryotic N-terminal methylatbn site; Prenyl 
group binding site (CA/\X box); Intein N- and C-terminal splicing motif profiles, and the like. 
Proteins involved in motility, including but not limited to chemokines, 8100 family proteins 
(including but not limited to NRAGE). 
Peptides - defensins 

peptide ligands including, but not limited to, a short region from the HIV-1 envelope 
cytoplasmic domain (shown to block the action of cellular calmodulin), regions of the Fas 
cytoplasmic domain (death-inducing apoptottc or G protein inducing functions), magainin, a 
natural peptide derived from Xenopus (anti-tumor and anti-microbial activity), short peptide 
fragments of a protein kinase C isozyme, RPKC (blocks nuclear translocation of full-length 
QPKC In Xenopus oocytes following stimulation), SH-3 target peptides, naturitic peptides 
(AMP, BMP, and CMP), and fibrinopeptides and neuropeptides. 

presentation scaffolds or "ministructures" including, but are not limited to, minibody structures 
(see for example Bianchi et al., J. Mol. Biol. 236(2):649-59 (1994), and references cited 
therein, all of which are incorporated by reference), maquettes (Grosset et al. Bbchemistry 
40:5474-5487 (2001)), k)ops on beta-sheet turns and coiled-coii stem structures (see, for 
example, Myszka et al., Biochem. 33:2362-2373 (1994) and Martin et al., EMBO J. 
13(22):5303-5309 (1994), incorporated by reference), zinc-finger domains, transglutaminase 
linked structures, cyclic peptides, B-loop structures, coiled coils, helical bundles, helical 
hairpins, and t>eta hairpins. 

Ion channel protein domains, including but not limited to sodium. cak:ium, potassium, and 
chloride, including their component subunit. Examples of extracellular ligand-gated ion 
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Channels include nAChR receptors, GABA and glycine. 5H-T. MOD-I, P(2X). glutamate 
NMDA, AMPA. Kainate receptors. GluR-B. ORCC. P2X3. Inward rectifying channels. ROMK.' 
IRK. BIR. and the like. Examples of voltage^ated ion channels. Examples of intracellular 
iigand-gated ion channels. Mechanosensative and ceO volume^egulated ion channels, and 
the like. 

In addition, a preJened embodiment utilizes scaffold proteins such as random peptkles. That is there 
is a significant amount of work being done in the area of ufilizing random peptides in high throughput 
screening technkjues to Wentify bfologically relevant (particularly disease states) proteins. The 
methods of the inventfon are partfculariy relevant for computationally prescreening random peptide 
fibraries to drastfcally reduce the amount of wet chemistry that must be done, by removing sequences 
that are unlikely to be successful. Different design criteria can be used to produce candidate sets that 
are biased for properties such as charge, solubility, or active site characteristfcs (polarity, size), are 
biased to have certain amino adds at certain positions or to lake on certain folds. That is. the ' 
peptides (which nrjay be the scaffold protein or the candkJate agents, as outlined below) are 
randomized, either fully randomized or they are biased in their randomizatton. e.g. in 
nucteotide/residue frequency generally or per positfon. By "randomized' or grammatical equivalents 
herein is meant that each nuclete acki and peptide consists of essentially random nucleotktes and 
amino acids, respectively. Thus, any amino ackl residue may be incorporated at any position. The 
synthetic process can be designed to generate randomized peptides and/or nucleic acWs. to allow the 
formation of all or most of the possible combinations over the length of the nucleic ackJ. thus fbmiing 
a library of randoitiized candidate nucleic adds. 

In one embodiment, the library is fully randomized, with no sequence preferences or constants at any 
positton. In a praferred embodiment, the Hbrary is biased. That is, some posittons within the 
sequence are either heki constant, or are selected from a limited number of possibilities. For 
example, in a preferred embodiment, the nucleotides or amino acW residues ara randomized within a 
defined class, for example, of hydrophobfc amino acids, hydrophifc reskJues, stericaDy biased (either 
small or large) residues, towards the creatton of cysteines, for cross-linking, prolines for SH-3 
domains, serines, threonines, tyrosines or hisddines for phosphorylatfen sites, ete.. or to purines etc 
In a preferred embodiment, the bias is towards peptides or nudeic adds that interact with known 
classes of molecules. For example, it is known that mudi of intracellular signaling is carried out via 
short regions of polypeptides interacting with other polypeptides through small peptide domains In 
addition, agonists and antagonists of any number of molecules may be used as the basis of biased 
randomizatfon of candidate bloactive agents as well. 
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In general, the generation of a prescreened random peptide libraries may be described as follows. 
Any structure, whether a known structure, for example a portion of a known protein, a known peptide, 
etc., or a synthetic structure, can be used as the backbone for computational screening. For 
example, stmctures from X-ray crystailographic techniques, NMR techniques, de novo modelling, 
homology modelling, etc. may all be used to pick a backbone for which sequences are desired. 
Similarly, a number of molecules or protein domains are suitable as starting points for the generation 
of biased randomized candMate bioactive agents. A large number of small molecule domains are 
known, that confer a common function, structure or affinity. In addition, as is appreciated in the art, 
areas of weak amino acid homology may have strong structural homology. A number of these 
molecules, domains, and/or corresponding consensus sequences, are known, including, but are not 
limited to, SH-2 domainSi SH-3 domains, Pleckstrin, death domains, protease cleavage/recognitfon 
sites, enzyme inhibitors, enzyme substrates, Traf, etc. Similarly, there are a numt)er of known nucleic 
acid binding proteins containing domains suitable for use in the invention. For example, leucine 
zipper consensus sequences are known. Thus, in general, known peptide ligands can be used as the 
starting scaffold backbone for the generation of the primary library. 

In a prefenred embodiment, the scaffold protein is a variant protein, including, but not limited to, 
mutant proteins comprising one or a plurality of substitutions, insertions or deletions, including 
chimeric genes, and genes that have been optimized in any number of ways, including experimentally 
or computationally. 

In a prefienned embodiment, the scaffold protein is a.chlmeric protein. A chimeric protein (sometimes 
referred to as a "fusion protein") in this context means a protein tiiat has sequences from at least two 
different sequences operably linked or fused. The chimeric protein may be made using eitiier a single 
linkage point or a plurality of linkage points. In addition, the source of the parent protein sequences 
may be as listed above for scaffokJ proteins, e.g. prokaryotes, eukaryotes, including archebacteria 
and viruses, etc. 

As will be appreciated by those in the art, chimeric proteins may be made from different naturally 
occunring proteins in a gene family (e.g. one with recognizable sequence or structural homofogy) or by 
artificially joining two or more distinct genes. For example, tiie binding domain of a human protein 
may be fused with the activation domain of a mouse gene, etc 

The sequence of the chimeric gene may be been constructed synthetically (e.g. arbitrary or targeted 
portions of two or more genes are crossed over randomly or purposely), experimentally (e.g. through 
homologous recombination or shuffling techniques) or computationally (e.g. using genetic annealing 
programs, "in silico shuffling", alignment programs, etc.). For the purposes of the invention, tiiese 
techniques can be done at \he protein or nucleic add level. 
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In a preferrBd embodiment, the scaffold protein is actually a p«>duct of a computational design cyde 
and/or screening process. That is, a first round of the methods of the invention may produce one or 
more sequences for which further analysis is desired. 

Although several classes of proteins have been stated herein, this should not be construed as an 
exhaustn,e list, but rather some examples of proteins that may be optimized using the computational 
methodologies outlined herein, including PDA™ technology. 

Preparation of Pmtein Backbone for Calculations 

■me protein scafibkJ may be modified or altered at the beginning (and optionally, but not preferably in 
the middle or end) of a protein design calculation, or the unaltered scaffold may be used It is also ' 
possible to use methods in which the protein scaffold is modified during later steps of a design 
calculation, including during the energy calculation and optimization steps. 

In a preferred embodiment, protein scaffold backbone (comprising, the nitrogen, the carbonyl carbon 
the o-carbon. and the carbonyl oxygen, along with the direction of the vector from the a-carbon to the 
P^arbon) may be altered prtor to the computational analysis, for example by varying a set of 
parameters called supersecondary structure parameters. See for example U.S. Patent Nos 
6.269.312. 6.188,965. and 6.403.312. all of which are herein expressly incorporated by reference 
Alternatively, the pmtein scaffold is altered using other methods, such as manually, indu ding directed 
or random perturbations 

Most protein structures contain loop regions that are flexible or conformationaily heterogeneous The 
protein backbone may be modified in the toop regfons using methods such as molecular dynamics 
simulations and analysis of databases of known toop stmctores. In addition. loops may be modified in 
order to incorporate new structural or fonctional properties such as new binding sites. 

in a preferred embodiment, the design cyde is done using a plurality or set of scaffold proteins That 

the scaffoW may be a set of protein structures created by perturbing the starting stmcture This 
may be done using any number of tediniques. induding molecular dynamics and Monte Carta 
analysis, that alter the protein strudure (induding changing the backbone and side chain torsion 
angles.) Alternatively, an ensemble of structores sud, as those obtained from NMR may be used as 
the scaffold. These badcbone modifications are particularly useful for enhandng the diversity of 
sequences derived fiom protein design simulations. Similarly, other useful ensembles include sets of 
related proteins, sets of related strudures. ariifidal created ensembles, etc. 

In a prefened embodiment, once a protein structure bad^bone is generated (with alterations, as 
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outlined above), explicit hydrogens are added if not included within the structure. For example, if the 
structure was determined using X-ray crystallography, hydrogens are typically added. 

In a preferred embodiment, energy minimization of the structure is run to relax strain, including strain 
due to van der Waals clashes, unfavorable bond angles, and unfavorable bond lengths. In an 
especially preferred embodiment, this is done by doing a number of steps of conjugate gradient 
mininnization (see Mayo et al., J. Phys. Chem. 94:8897 (1990)) of atomic coordinate positions to 
minimize the Drelding force field with no electrostatics. Generally from 10 to 250 steps Is preferred, 
with 50 steps being most preferred. 

Identification of Variable. Floated, and Fixed Positions 

In a prefenred embodiment, all of the residue positions of the protein are variable. This is particularly 
desirable for smaller proteins, although the present methods allow the design of larger proteins as 
well. In an alternate preferred embodiment, only some of the residue positions of the protein are 
variable, and the remainder are fixed or floated. In this embodiment, the variable residues may be at 
least one, or anywhere from 0.001% to 99.999% of the total number of residues. Thus, for example, It 
may be possible to change only a few (or one) residues, or most of the residues, with all possibilities 
in between. 

In an alternate embodiment, only one or two residue positions are variable and the residue positbns 
within a small distance of, for example, 4A to 6A of the variable residue positions are floated, in this 
embodiment, it is possible to conduct separate calculations for different positions and then combine 
the results to yield protein variants with multiple mutations. Using the results from one calculation as 
a starting point for the next calculation one residue position at a time, the optimization procedure may 
be iterative. Iteration may be performed until a consistent result is reached. 

in a prefen-ed embodiment, residues which may be fixed include, but are not limited to, structurally or 
biologically functional residues. For example, residues which are known to be important for biological 
activity, such as the residues which form the active site of an enzyme, the substrate binding site of an 
enzyme, the binding site for a binding partner (ligand/receptor, antigen/antibody, etc.), 
phosphorylation or glycosylation sites, or structurally important residues, such as cysteines 
participating in disulfide bridges, metal binding sites, critical hydrogen bonding residues, residues 
critical for backbone conformation such as proline or glycine, residues critical for packing interactions, 
etc. may all be fixed or floated. 
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Similarly, residues which may be chosen as variable residues may be those that confer undesirabfe 
bioiogica! attributes, such as susceptibility to proteolytic degradation, unwanted oligomerization or 
aggregation, glycosylation sites which may lead to unwanted immune responses, unwanted binding 
activity, unwanted allostery, undesirable enzyme activity, etc. 

Alternatively, residues that confer desired protein properties may be specifically targeted for variation. 
In a preferred embodiment, this design strategy may be used to after properties such as binding 
affinity and specificity and catalytic efficiency and mechanism. A region such as a binding site or 
active site may be defined, for example, to include aO residues within a certain distance, for example 4 
" 1 0 A, or preferably 6 A, of the residues that are in van der Waals contact with the substrate or 
ligand. Alternatively, a region such as a binding site or active site may be defined using experimental 
results, for example, a binding site could include all positions at which mutation has been shown to 
affect binding. 

Select Ami no Acids to be Considered at Each Position 

A set of amino acid side chains is assigned to each variable position. That is. the set of possible 
amino add side chains that will be considered at each particular position is chosen. In one 
embodiment, variabte positions are not classified and all amino acids are considered at each variable 
position. Alternatively, a subset of amino acids are considered at each variable position. Methods for 
determining subsets of amino acids include, but are not limited to. those discussed below. Any 
combination of classification methods. Including no classification, may be applied to the different 
variable positions. 

In a prefenred embodiment, all amino acid residues are allowed at each variable residue position 
Identified in the primary library. That is. once the variable residue positions are identified, a 
secondary library comprising every combination of every amino acid at each variable residue position 
is made. 
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In a preferred embodiment, subsets of amino acids are chosen to maximize coverage. Additional 
amino acids with properties similar to those contained within the primary library may be manually 
added. For example, if the primary library includes three large hydrophobic residues at a given 
position, the user may chose to include additional large hydrophobic residues at that position when 
generating the secondary library. In addition, amino acids in the primary library that do not share 
similar properties with most of the amino acids at a given position may be excluded from the 
secondary library. Alternatively, subsets of amino acids may be chosen from the primary library such 
that a maximal diversity of side chain properties is sampled at each position. For example, if the 
primary library includes three large hydrophobic residues at a given position, the user may chose to 
Include only one of them in the secondary library, in combination with other amino acids that are not 
large and hydrophobic. 

In a preferred embodiment, each variable position is classified as either a core, surface or boundary 
residue position. The classification of residue positions as core, surface or boundary may be done in 
several ways, as will be appreciated by those in the art. In a prefenred embodiment, the classification 
is done via a visual scan of the original protein scaffold and assigning a classification based on a 
subjective evaluation of one skilled in the art of protein modeling. Alternatively, a preferred 
embodiment, called RESCLASS. utilizes an assessment of the orientation of the Ca-CP vectors 
relative to a solvent accessible surface computed using only the template Co atoms, as outlined in 
U.S. Patent Nos. 6,269,312, 6,188,965, and 6,403.312. and expressly herein incorporated by 
reference. Alternatively, a surface area calculation may be done. In an alternate embodiment, the 
results of the RESCLASS calculation are used In conjunction with the results of a surface area 
calculation in order to classify residue positions. 

A core residue will generally t>e selected from a set of hydrophobic residues consisting of alanine, 
valine, isoleucine. leucine, phenylalanine, tyrosine, tryptophan, and methionine (in some 
embodiments, methionine may be removed from the set). Similarly, surface positions are generally 
selected from a set of hydrophilic residues consisting of alanine, serine, threonine, aspartic acid, 
asparagine, glutamine, glutamic acid, arginine. lysine and histidine. Finally, boundary positions are 
generally chosen from alanine, serine, threonine, aspartic acid, asparagine. glutamine, glutamic acid, 
arginine, lysine histidine, valine, isoleucine, leucine, phenylalanine, tyrosin e, tryptophan, and 
methionine. 
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In a pfBfBrred embodiment, proline, cysteine and glycine are not included in the Bst of possible amino 

acid side chains, and thus the rotamers for these side chains are not used. However, in an alternate 

preferred embodiment, when the variable residue position has a <p angle (that is. the dihedral angle 

defined by 1) the carbonyl carbon of the preceding amino add; 2) the nitrogen atom of the current 

residue; 3) the a-carbon of the current residue; and 4) the carbonyl carbon of the current residue) 

greater than 0 degrees, the position is set to glycine to minimize backbone strain. In an altemate 

embodiment, cysteine is considered at positions where disulfide bonds are desired. In another 

alternate embodiment proline is considered at positions whose backbone conlbmiatfen is altowable 
for proline. 

As will be appreciated by those in the art, there Is a computational benefit to classifying the residue 
positions, as it decreases the combinatorial complexity of the problem. It shouW also be noted that 
there may be situations where alternative classification approaches wfll be applied or where the sets 
of core, boundary and surface residues are altered from those described above; for example, under 
some circumstances, one or more amino acids is either added or subtracted from the set of allowed 
amino adds. For example, hydrophobk: reskfues may be induded at solvent exposed positions in 
order to confer desired oligomerization or ligand binding activity, and polar residues may be induded 
in the core of the protein in order to construct an active site or a binding site. Similarly, i n one 
embodiment, only residues capable of fonnhig N-capping interactions are included at the position 
immediately preceding each helix, and amino adds that interact unfavorably with the helix dipole are 
subtraded from the set of polar residues at the three positions at the beginning and end of each helix. 

In a preferred embodiment, the set of amino adds allowed at each positfon is determined using 
sequence or structure alignment methods. For example, the set of amino acids allowed at each 
positton may comprise the set of amino adds that is observed at that position in the alignment, or the 
set of amino adds that is observed most fiequently in the alignment 

In another.preferred embodiment, the set of amino acids allowed at each positron comprises the set of 
amino ackJs that are known to interad with a partfcular dass of molecules or to serve a specific 
fundion. Possible sets indude. but are not limited to . residues that may ligate or coordinate to certain 
metals (such as zinc, copper, iron, and molybdenum), residues that may undergo posttranslational 
modification (sudi as phosphoiylatfon. glycosylaton. prenylation, and lipidation). and resklues that 
are amenable to synthetic modificatron. Synthetic modificaffons indude. but are not limited to. 
alkylation or acyiatron which indudes but is not limited to PEGylatfon. biotinylatron, fluorophore 
conjugation, acetylation. oxidative or reductive homo- or heterooligomerization. native ligation, 
conjugation to synthetic mono- and oligosaccharides, and covalent or non-covalent attachmeiit to a 
solid support (e.g. glass beads, glass slides, or 96-we« plates). Sites of synthetic modifkations 
indude. but are not limited to. the amide N-H. the amino add side chains, the amino or cartjoxyl 
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terminus of the protein, or any of the various posttranslational modifications. 

In a preferred embodiment the set of allowed amino adds includes one or more non-natural or 
noncanonical amino acids. Synthetic modifications of the non-natural or non-canonical amino adds 
are also viable. In addition to the modifications listed above, these synthetic transformations include, 
but are not limited to intra- and intermolecular metal mediated couplings such as the Heck reaction or 
Suzuki coupling and conjugation through shiff base formatton. In a prefenred embodiment, the set of 
allowed amino acids includes more than one charge state for some or all of the acidic or basic 
residues (that is, arginine, lysine, histidine, glutamic add, aspartic acid, cysteine, and tyrosine). 

Select the Set of Rotamers That Will Be Used to Model Each Residue Type 
In a preferred embodiment, a set of discrete side chain oonfonnations, called rotamers, are 
considered for each amino acid. Thus, a set of rotamers will be considered at each variable and 
floated position. Rotanriers may be obtained from published rotamer libraries (see Lovel et al.. 
Proteins: Structure Function and Genetics 40:389-408 (2000) Dunbrack and Cohen Protein Science 
6:1661-1681 (1997); DeMaeyeret al.. Folding and Design 2:53-66 (1997); Tuffery etal. J. Biomol. 
Struct. Dyn. 8:1267-1289 (1991), Ponder and Richards, J. Mol. Biol. 193:775-791 (1987)), from 
molecular mechanics or ab initio calculations, and using other methods. In a preferred embodiment, a 
flexible rotamer model is used (see Mendes et al.. Proteins: Structure, Function, and Genetics 
37:530-543 (1999)) Similarly, artificially generated rotamers may be used, or augment the set chosen 
for each amino acid and/or variable position. In a preferred embodiment, at least one conformation 
that is not low in energy Is included in the list of rotamers. In an alternative embodiment, the identity of ^ 
each amino acid, rather than spedfic conformational states of each amino acid, are used, i.e., use of 
rotamers is not essential. 

Generating ranks or lists of possible sequences 

In essence, any computational methods that may result in either the relative ranking of the possible 
sequences of a protein or a list of suitable sequences may be used to generate a primary library. As 
will be appreciated by those in the art, any of the methods described herein or known in the art may 
be used. Each method may be used ak>ne, or in combination with other methods. In a preferred 
embodiment, knowledge-based and statistical methods are used. Alternatively, methods that rely on 
energy calculations may also be used. Protein design methods use various criteria to screen 
sequences, resulting in sequences that are likely to possess desired properties. The design criteria 
may be altered to generate primary libraries that are likely to contain proteins possessing a different 
set of desired properties. 

Knowledge-based and Statistical Methods 

In a preferred embodiment, sequence and/or stmctural alignment programs may be used to generate 
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primary libraries. For example, various alignment methods may be used to create sequence 
alignments of proteins related to tiie target structure (see for example Altschul et al., J. Mol. Biol. 
215(3): 403 (1990), Incorporated by reference). Sequences may be related at the level of primary, 
secondary, or tertia^r structure. Alternatively, sequences may be related by function or activity. 
These sequence alignments are then examined to determine the observed sequence variations. 
These sequence variations are tabulated to define a primary library, or used to bias the convergence 
of a protein design aigoritftm. 

As is known in the art. sequence alignments can be analyzed using statistical methods to calculate 
the sequence diversity at any position in the alignment, and the occurrence frequency or probablBty of 
each amino acid at a position. In the simplest embodiment, these occurrence frequencies are 
calculated by counting the number of times an amino add is observed at an alignment position, then 
dividing by the total number of sequences in the alignment In other embodiments, the contribution of 
each sequence, position or amino acid to the counting procedure is weighted by a variety of possible 
mechanisms. For example, sequences may be weighted towards or away from a wild type sequence, 
towards a human sequence, eta 

Furthermore, the sequence alignments may be analyzed to produce the probability of observing two 
residues simultaneously at two positions. These probabilities may serve as a measure of the strength 
of coupling between residues. In one embodiment, the probabilities may then be used to favor 
selection of sequences that maintain conserved residue pairs and disfavor selection of sequences 
that contain pairs that are seldom or never observed In sequence homologs. 

As Is known in the art. there are a number of sequence-based alignment programs; including for 
example. Smith-Waterman searches. Needteman-WUnsch, Double Affine Smith-Waterman, frame 
search, Gribskov/GCG profile search, Gribskov/GCG pnjfile scan, profile frame search. Bucher 
generalized profiles. Hidden Markov models, Hframe. Double Frame. Blast. Psi-Blast. Clustal. 
GeneWlse. and FASTA. 

The source of the sequences may vary widely, and include taking sequences from one or more of the 
known databases, including, but not limited to. SCOP (Hubbard, et al.. Nucleic Acids Res 27(1): 254- 
256. (1999)); PFAM (Bateman, et al., Nuclefe Acids Res 27(1): 260-262. (1999) 
http://www.sanger.ac.uk/PfamO; TIGRFAM (http://www.tigr.org/TIGRFAMs); VAST (GIbrat. et al., Cun- 
Opin Struct Biol 6(3): 377-385. (1996)); CATH (Orengo, et al.. Structure 5(8): 1093-1108. (1997)); 
PhD Predictor (http://www.embl-heidelberg.de/predictproteiii/predictprotein.html); Prosite (Hofmann, 
etal.. Nucleic Acids Res27(1): 215-219. (1999) http://www.expasy.ch/prosite/); SwissProt 
(http:/ywww.expasy.ch/sprotO; PIR (http://www.mips.biochem.mpg.de/prpl/protseqdb/); GenBank 
(http://wvwv.ncbi.nlm.nih.gov/GenbankO; Entrez (http://wviw.ncbi.nlm.nih.gov/entrezO; RefSeq 
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(http://Mnivw.ncbi.nlm.nih.gov/Locu5Llnl^refseq.hM EMBL Nucleotide Sequence Database 
(http://www.ebLac.uk/embl/); DDBJ (http://www.ddbj.nig.ac.jp/); PDB (www.rcsb.org) and BIND 
(Bader. et al., Nucleic Acids Res 29(1): 242-245(2001 ) http://www.bind.ca/). In addition, sequences 
may be obtained from genome and SNP databases of organisms including, but not limited to. human, 
mouse, worm, fly, plants, fungi, bacteria, and viruses. These may include public databases, for 
example The Genome Database of The Human Genome Project (http://gdbwww.gdb.org/), or private 
databases, for example those of Celera Genomics Corporation (http7/www.ceiera.com/) or incyte 
Genomics (http://www.incyte.com/). 

In a preferred embodiment, the contribution of each aligned sequence to the frequency statistics is 
weighted according to its diversity weighting relative to other sequences in the alignment. A common 
strategy for accomplishing this is the sequence weighting system recommended by HenikofF and 
Henikoff (see Henikoff S, Henikoff JG. Amino acki substitutbn matrtoes, Adv Protein Chem. 2000 ; 
54:73-97. Review. PMID: 10829225 and Henikoff S, Henikoff JG. Position-based sequence weights. J 
Mol Biol. 1994 Nov 4; 243(4): 574-8.PMID: 7966282). each are herein expressly incorporated by 
reference. 

In a preferred embodiment, only sequences within a preset level of homology to the template 
sequence are included In the alignment (> 60% identity, > 70% identity, etc.) 

In a prefered embodiment the contribution of each sequence to the statistics is dependent on its 
extent of similarity to the target sequence, such that sequences with higher similarity to the target 
sequence are weighted more highly. Examples of similarity measures include, but are not limited to, 
sequence identity, BLOSUM similarity score, PAM matrix similarify score, and Blast score. 

In a preferred embodiment, the contribution of each sequence to the statistics Is dependent on its 
known physical or functional properties. These properties Include, but are not limited to, thermal and 
chemical stability, contribution to activity, solubility, etc. For example, when optimizing the target 
sequence for solubility, those sequences in an alignment with high solubility levels will contribute 
more heavily to the calculated firequencies. 

In a preferred embodiment, each of the weighted or unweighted alignment frequencies Is converted 
directly to a pseudo-energy as -log (fa). Thus, amino acids with higher frequency are assigned lower 
(more favorable) pseudo energies. If a frequency is zero, a constant positive pseudo energy may be 
applied. 

In a preferred embodiment, each of the final alignment frequencies (fa) is divided by the observed 
frequency (fo) of occurrence of each amino ackJ in all proteins. The log of this ratio, known to those in 
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the art as the log-odds ratio. log(f A), reflects the extent of natural selection for/against each amino 
acid at each position in the protein. Positive numbers reflect positwe selection while negative 
numbers reflect negative selection. These log-odds ratios may then be used as pseudo energy terms 
within a PDA™ technology simulation. In situations where lower energies are favorable, the negative 
log-odds, -logCfaffo). is a more appropriate pseudo ene^y term. If a frequency is zero, a constant 
positive energy may be applied. 

In a preferred embodiment, no pseudo energies are created. Rather, the position-spedfic alignment 
information is used directly to generate the list of possible amino acids at a variable residue position in 
a PDA™ technology simulation. Lehmann M, Wyss M. Engineering proteins for thermostability: the 
use of sequence alignments versus rational desgn and directed evolution.Cun- Opin Biotechncl. 2001 
Aug; 12(4): 371-5. Review; Lehmann M, Pasamontes L, Lassen SF. VVyss M. The consensus concept 
for tiiermostability engineering of proteins. Biochim Bfophys Acta. 2000 Dec 29; 1 543(2): 408-415. 
Review; Ratii A. Davidson AR. The design of a hyperstaWe mutant of the Abpl p SH3 domain by 
sequence alignment anaVsis.Protein Sci. 2000 Dec;9(12):2457-69; Lehmann M. Kostrewa D. Wyss 
M. Brugger R. D'Arcy A Pasamontes L. van Loon AP. From DNA sequence to Improved functionality: 
using protein sequence comparisons to rapidly design a ttiemiosteble consensus phytase. Protein 
Eng. 2000 Jan;13(1):49-57; Desjarlais JR. Berg JM. Use of a zinc-finger consensus sequence 
framework and specificity mies todesign specific DNA binding proteins. Proc Nafl Acad Sd U S A 
1993 Mar 15;90(6):2255-60; Desjarlais JR. Berg JM. Redesigning the DNA-binding specificity of a 
zinc finger protein: a datebase-guided approach. Proteins. 1992 Feb;12(2):1G1-4; Henikoff S. Henikoff 
JG. Amino acid substitution matrices. Adv Protein Chem. 2000; 54:73-97. Review. PMID: 10829225; 
Henikoff S. Henikoff JG. Position-based sequence weights. J Mol Bid. 1994 Nov 4; 243(4):574- 
8.PMID: 7966282.- 

Similariy. stmctural aygnment of structerally related proteins may be done to generate sequence 
alignmente. There are a wWe variety of such structural alignment programs known. See for example 
VAST from the NCBI (http://wviw.ncbi.nlm.nih.gov: 80/Stnjcture/VAST/vasLshtml); SSAP(Orengoand 
Taytor. Metiiods Enzy mol 266(617-635 (1996)) SARF2 (Alexandrov, Protein Eng 9(9): 727-732. 
(1996)) CE (Shindyalov and Bourne. Protein Eng 11(9): 739-747. (1998)); (Orengo et al.. Sft-ucture 
5(8): 1093-108 (1997); Dali (Holm et al.. Nucleic Acid Res. 26(1): 316-9 (1998). all of which are 
incorporated by reference). These sti-ucturaily-generated sequence alignments may Uien be 
examined to detennine the obsen«d sequence variations. 

In a preferred embodiment, reskfue pair potentials may be used to score sequences (Miyazawa et al., 
Macromolecules 18(3):534-552 (1985) Jones. Protein Science 3: 567-574. (1994); PROSA (Heindlfch 
et al.. J. Mol. Bfol. 216:167-180 (1 990); THREADER (Jones et al.. Nature 358:86-89 (1992). expressly 
incorporated by reference) during computational screening. 
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In a preferred embodiment sequence profile scores (see Bowie et al., Science 253(5016): 164-70 
(1991), incorporated by reference) and/or potentials of mean force (see Hendlich et aL, J. Mol. Biol, 
216(1): 167-180 (1990), also incorporated by reference) are calculated to score sequences. 
Weighting using these methods determines the structural homotogy between the sequence and the 
three-dimensional structure of a reference sequence. These methods assess the match between a 
sequence and a three-dimensional protein structure and hence may act to screen sequences for 
fidelity to the protein structure, in particular, U.S. Patent Nos. 6,269,312, 6,188,965, and 6,403,312, 
and herein expressly incorporated by reference, describe a method termed "Protein Design 
Automation", or PDA^*^ technology, that utilizes a number of scoring functions to evaluate sequence 
stability. 

Primary libraries may be generated by predicting tertiary structure from sequence, and then selecting 
sequences that are compatible with the predicted tertiary structure. There are a number of tertiary 
structure prediction methods, including, but not limited to, threading (Bryant and Altschul, Curr Opin 
Struct Biol 5(2): 236-244. (1995)), Profile 3D (Bowie, et aL, Methods Enzymol 266(598-616 (1996); 
MONSSTER (Skolnick, et al., J Mol Biol 265(2): 217-241. (1997); Rosetta (Simons, et ai.. Proteins 
37(S3): 171-176 (1999); PSI-BLAST (Altschul and Koonin. Trends Biochem Sci 23(11): 444-447. 
(1998)); Impala (Schaffer, etaL, Bloinformatics 15(12): 1000-1011. (1999)); HMMER (McCiure, et aL. 
Proc Int Conf Intell Syst Mol Biol 4(155-164 (1996)); Ctustal W (http://www.ebi.ac.uk/clustalw/); ), 
heliX'-coil ti-ansition theory (Munoz and Senrano, Biopolymers41:495, 1997), neural networks, k)cal 
structure alignment and others (e.g., see in Selbig etal., Bioinformati'cs 15:1039, 1999). 

In an alternate embodiment, the primary library consists of all sequences whose binary pattern, or 
arrangement of hydrophobic and polar residues, is predicted to be compatible witii formation of the 
desired protein structure (Kamtekar et al.. Science 262(5140): 1680-5 (1993). In an alternate 
embodiment, two profile metiiods (Gribskov et al. 84:4355-4358 (1987) and Fischer and 
Eisenberg, Protein Sci. 5:947-955 (1996), Rice and Eisenberg J. Mol. Biol. 267:1026-1038(1997)), all 
of which are expressly incorporated by reference) are used to generate the primary library. 

In a further embodiment, a knowledge-t»ased amino acid substitution matrix can be used to guide the 
convergence of a protein design cycle. Examples of such matrices include, but are not limited to: 
BLOSUM matrices (e.g. 62. 90, etc.), RAM matrices (e.g. 250, etc.). and Dayhoff matiices. 

Energy Calculation Methods 

Force field calculations that may be used to optimize the conformation of a sequence witiiin a 
computational method, such as molecular dynamics and rotamer placement methods, or to generate 
de novo optimized sequences as outlined herein. These methods can be used in any step of the 
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methods of the invention, including their use to generate a prinnary or secondary Dbrary. 

Force fields include, but are not limited to. ab initio or quantum mechanical force fields, semi-empirical 
force fields, and molecular mechanics force fields. Examples offeree fields include OPLS-AA 
(Jorgensen. etal., J. Am. Chem. Soc. (1996), v 118. pp 1 1226-11236; Jorgensen. W.L; BOSS, 
Version 4. 1 ; Yale University: New Haven. CT (1 999)); OPLS (Jorgensen. et al.. J. Am. Chem. Soc. 
(1988). V 110. pp 1657ff; Joixfensen. et a!.. J Am. Chem. Soc. (1990). v 1 12. pp 4768ff); UNRES 
(United Residue Forcefield; Liwo. et al., Protein Science (1993), v 2, pp1697-1714; Liwo, et al.. 
Protein Science (1993). v 2. pp1716-1731; Liwo. et al.. J. Comp. Chem. (1997). v 18, pp849-873; 
Liwo. et al., J. Comp. Chem. (1997). v 18, pp874-884; Liwo. et al.. J. Comp. Chem. (1998), v 19. 
PP259-276; Forcefield for Protein Structure Prediction (Liwo, et al., Proc. Nad. Acad. Sd. USA (1999), 
V 96. PP5482-5486); ECEPP/3 (Liwo et al.. J Protein Chem 1994 May; 13(4): 375-80); AMBER 1.1 
force field (Weiner. et ai.. J. Am. Chem. Soc. v106. pp765-784); AMBER 3.0 force field (U.C. Singh et 
al., Proc. Natl. Acad. Sci. USA 82:765-759); CHARMM and CHARMM22 (Brooks, et al., J. Comp. 
Chem. v4. pp 187-217); cvff3.0 (Dauber-Osguthorpe, et aL.(1988) Proteins: Structure. Function and 
Genetics, v4,pp31-47); cff91 (Maple, etal.. J. Comp. Chem. v16. 162-182); also, the DISCOVER (cvff 
and cff91) and AMBER forcefields are used in the INSIGHT molecular modeling package 
(Biosym/MSI, San Diego Califomla) and HARMM is used in the QUANTA molecular modeling 
package (Biosym/MSI, San Diego Califomla). HF. UHF. MCSCF. CI, MPx. MNDO. AMI. and MINDO 
are techniques known to those skilled in the art and which may be used to perform computational site 
directed mutagenesis for protein design, (see Szabo et al. Modern Quantum Chemistry: Introduction 
to Advanced BectronIc Structure Theory, Macmiflan, New York, (c1982) and Hehre. Ab Initio 
Molecular Orbital Theory, Wiley. New York (c1986) all of which are expressly incorporated by 
reference.) 

In a prefenred embodinnent. the scaffoW protein is an enzyme and highly accurate electrostatic models 
may be used for enzyme active site residue scogng to improve enzyme active site libraries (see 
W^rshel. Computer Modeling of Chemical Reactions in Enzymes and Solutions. Wiley & Sons. New 
York. 1991. ^hereby expressly incorporated by reference). These accurate models may assess the 
relative energies of sequences with high precision, but are computationally intensive. Highly accurate 
electrostatic models may also be used in the design of binding sites. 

Furthermore, scoring functions may be used to screen for sequences that would create metal or co- 
factor binding sites in the protein (Heliinga. Fold Des. 3(1): R1-8 (1998). hereby expressly 
incorporated by reference). Similarly, scoring functions may be used to screen for sequences that 
wouki create disulfide bonds in the protein. 

In a preferred embodiment, rotamer library selection methods are used to generate the primary 
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library. (Dahiyat and Mayo. Protein Sci 5(5): 895-903 (1996); Dahiyat and Mayo, Science 278(5335): 
82-7 (1997); Desjarlais and Handel, Protein Science 4: 2006-2018 (1995); Harbury et al. PNAS USA 
92(18): 8408-8412 (1995); Kono et al.. Proteins: Structure. Function and Genetics 19: 244-255 
(1994); Hellinga and Richards, PNAS USA 91: 5803-5807 (1994). 

in a preferred embodiment, a sequence prediction algorithm (SPA) is used to design proteins that are 
compatible with a known protein backbone structure as is described in Raha, K., et al. (2000) Protein 
ScL, 9: 1106-1119. U.S.S.N. 09/877,695; USSN to be determined for a continuation-in-part 
application filed on February 6, 2002. entitled APPARATUS AND METHOD FOR DESIGNING 
PROTEINS AND PROTEIN LIBR/^IES, with John R. Desjarlais as inventor, expressly incorporated 
herein by reference. 

In an alternate embodiment, other inverse folding methods such as those described by Simons et al. 
(Proteins. 34:535-543, 1999). Levitt and Gerstein (PNAS USA. 95:5913-5920, 1998). Godzik et al.. 
PNAS. V89. PP 12098-102; Godzik and Skolnick (PNAS USA. 89:12098-102. 1992). Godzik et al. (J. 
Mol. Biol. 227:227-38. 1992) may be used. 

In an alternate embodiment, molecular dynamics calculations may be used to computationally screen 
sequences by individually calculating mutant sequence scores and compiling a rank ordered list 

In addition, other computational methods such as those described by Koehl and Levitt (J. Mol. Biol. 
293:1161-1181 (1999); J. Mol. Biol. 293:1183-1193 (1999); expressly incorporated by reference) may 
be used to create a primary library. 

PD/^TM Technology Calculations 

In an especially preferred embodiment, the primary library is generated and processed as outlined in 
U.S. Patent Nos. 6.269.312. 6,188.965, and 6,403,312. and are herein expressly incorporated by 
reference. This processing step entails analyzing interactions of the rotamers with each other and 
with the protein backbone to generate optimized protein sequences. Slmplistlcal ly , the processing 
initially comprises the use of a number of scoring functions to calculate energies of interactions of the 
rotamers, with the backbone and with other rotamers. Prefen-ed PDA™ technology scoring functions 
include, but are not limited to. a van der Waats potential scoring function, a hydrogen bond potential 
scoring function, an atomic solvation scoring function, a secondary structure propensity scoring 
function and an electrostatic scoring function. As is further described below, at least one scoring 
function Is used to score each variable or floated position, although the scoring functions may differ 
depending on the position classification or other considerations. 

As will be appreciated by those skilled in the art, a variety offeree fields that may be used in the 
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PDA™ technology calculations. These include, but are not limited to. those fisted previously. As 
outOned in U.S. Patent Nos. 6,269.312. 6.188.965. and 6.403.312. which are herein expressly 
incorporated by reference, any combination of the preferred scoring ftinctions. either alone or in 
combination, may be used. For example, in an altemate embodiment, rotamer internal energies are 
included. In additional embodiments, energies or scores that are a function of the confbmiation 
and/or identity of three or more amino adds are included. 

In further embodiments, additional terms are Included to influence the energy of each rotamer state, 
including but not limited to, reference energies, psuedo energies based on rotamer statistics, and 
sequence biases derived from multiple sequence alignments. Because sequence alignment 
infbmiatlon and rational methods have demonstrated utility for protein optimization, the invention is an 
improvement via its combination of Information from both methods. Sequence alignment information 
atone may sometimes be misleading because of unfavorable couplings between amino acids that 
occur commonly in a multiple sequence alignment. Rational metiiods alone, may have limitations, for 
e)Qmple. are subject to systematic errors due to Improper parameterization offeree field components 
and weights. 

In a preferred embodiment the scoring functions may be altered. Additbnal scoring functions may be 
used. Additional scoring functions include, but are not limited to torsional potentials, entropy 
potentials, additional solvation nKxJels including contact models, solvent exclusion models (see 
Lazaridis and Karplus. Proteins 35(2): 133-52 (1999)). and the like; and models for immunogenidty. 
(see U.S.S.N.s 09/903.378, 10/039,170. and PCT/US02/00165. herein expressly incorporated by 
reference) such as functions derived from data on binding of peptides to MHC (Major 
HIstocompatibfljty Complex), tiiat may be used to ideritify potentially immunogenic sequences. Such 
additional scoring functions may be used alone, or as functions for processing the library after it is 
initially scored. 

Altered scoring functions may also be obtained from analysis of experimental data. For example, if 
ttie presence of certain residues at certain positions are correlated with the presence of desired 
protein properties, a scoring function may be generated which favor these certain residues. 

In addition, other metiiods may be used to ''train" scoring functions by comparing designed sequences 
and tf^eir properties to natural sequences and their properties. That Is, the relative importance, or 
weight, given to individual scoring functions can be optimized in a variety of ways. Alttiough a variety 
of useful scoring functions exist ttiat represent van der Waals, electrostatics, solvation, and other 
temis, an important aspect of a force field is the contiibution (or weight) of each scoring amotion to 
the total score. In a preferred embodiment, computational sequence screening may be used to 
identify force field parameters such tiiat properties of natural proteins are mimicked in computationally 
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instances it may be desirable to include ail sequences when a defined nunnber of variable positions 
are used. It Is usually preferable for the primary library to be small enough that a reasonable fraction 
of the sequence space of a particular sequence may be sampled, allowing for robust generation of 
secondary libraries. Thus, primary libraries that range from about 50 to 10^^ are preferred, with from 
1000 to 10^ being particularly prefenred, and from 1000 to 100.000 being especially preferred. Thus, 
in one preferred embodiment, the primary library excludes from 1% to about 90-95% of possible 
sequence space sequences, with exclusion of at least 1%, 2%, 5%, 10%. 20%, 40%. 50% and 70% 
being prefen^. Alternatively, the library may include 1 in 10^ 1 in 10^ 1 in 10^°, 1 in 10^^ 1 in 10^°, 
1 in 10^® and 1 in 10^°. 

A variety of approaches may be used to select a set of sequences for the primary library, including 
stmcture-based methods such as PDA™ technology sequence-based methods, or combinations as 
outlined herein. In addition, as noted herein, any method used to generate a primary or secondary 
library may be used as the other step. 

It should also be noted that while these methods are described in conjunction with limiting the size of 
the primary library, these same techniques may be used to formulate a cutoff for inclusion in the 
secondary and tertiary libraries as well. 

The set of protein sequences in the primary and secondary libraries are generally, but not always, 
significantly different from the wild-type sequence from which the backbone was taken, although in 
some cases the primary or secondary library may contain the wild-type sequence. That is, the range 
of optimized protein sequences is dependent upon many factors including the size of the protein, 
properties desired, etc. However, for example, comprises between 0.001% and 100% variant amino 
acids, with about at least 90%, 70%, 50%, 30%, 10% variant amino acids being preferred. 

In a preferred embodiment, the primary library sequences are obtained from a rank-ordered list or 
filtered set generated using an algorithm such as Monte Cario, B&B, or SCMF. For example, the top 
10^ or the top 10^ sequences in the rank-ordered list or filtered set may comprise the primary library. 
Alternatively, all sequences scoring within a certain range of the optimum sequence may be used. 
For example, all sequences within 10 kcal/mol of the optimum sequence could be used as the 
primary library. In addition, as outlined below, any cut of a rank-ordered list or a filtered set may be 
used depending on the conditions, use and additional methodologies of the resulting set; for example, 
the top X number of sequences may be used, or the top X and the bottom Y number of sequences, 
for example when a wider range of sequence space is to be explored or when clustering is used. This 
method has the advantage of using a direct measure of fidelity to a three-dimensional structure to 
determine inclusion. 
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Alternatively, the total number of sequences defined by the recombination of all mutations may be 
used as a cutoff crterion for the primary sequence library. Preferred values for the total number of 
recombined sequences range from 100 to 10^. particularJy preferred values range from 1000 to 10". 
especiayy preferred values range from 1000 to 10'- Altemath/ely. a cutoff may be enforced when a 
predetermined number of mutations per position is reached. As a ranlt-ocxlered (or unordered) or 
filtered set sequence list is lengthened and the library is enlarged . the number of mutations per 
position wiU typically increase. Alternatively, the first occurrence in the list of predefined undesirable 
residues may be used as a cutoff criterion. For example, the first hydrophlllc residue occurring in a 
core position could limit the set of sequences included in the primary library. Alternatively, when 
multiple related structures are used for the scaffold, the set of optimal sequences for each structure 
may be used to make the primary Dbrary. 

In addition, in some embodiments, sequences that do not make the cutoff are included in the primary 
library. This may be desirable in some situations, for instance to evaluate the primary library 
generation method, to serve as controls or comparisons, or to sample additional sequence space. 
For example, in a preferred embodiment, the wild-type sequence is included, even if it did not make 
the cutoff. 

As is further outlined below, it shoukl also be noted that different primary libraries may be combined. 
For example, positions in a protein that show a great deal of mutattonal diversity in computational 
screening may be fixed as outlined betow and a different primary library regenerated. A rank-ordered 
list or filtered set of the same length as the first wouM now show diversity at positions that were 
largely conserved in the first library. The variants from a first primary library may be combined with 
the variants from a second primary Hbrary to provide a combined libraiy at tower computational cost 
than creating a very tong rank-ordered list or filtered set This approach may be particulariy useful to 
sample sequence diversity in both highly mutatable and highly conserved positions. In additfon. 
primary libraries may be generated by combining the results of two or more calculations to form one 
primary library. 

CLUSTERING 

Clustering algorithms may be useful for classilying sequences derived by protein design algorithms 
into representative groups. Qustering can sen^e a wide variety of purposes. For example, sets of 
sequences that are dose in sequence space can be distinguished from other sets, and thus 
recombination can be confined within sets. That is. sequences that share a tocal minima may be 
recombined. to altow better results, rather than recombine sequences from two tocal minima that may 
have quite different sequences. Thus, for example, a primary library can be clustered around local 
minima ("clustered sets of sequences'). recombinaUon or secondary library generation is within each 
clustered set, and then each "clustered" secondary library is added to fbrn. the secondary library 
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genus. 

Clustering algorithms require two key components. First is a metric for comparing the similarity of two 
entitles. Measures of similarity include, but are not limited to sequence identity, sequence similarity, 
and energetic similarity. Second, clustering algorithms require an algorithm to separate the entitles 
into groups based on relative similarities. Many types of clustering algorithms exist, the most simple 
and commonly used are single-linkage, complete linkage, and average linkage methods (see Figure 
5). These are often applied hierarchically, such that the relationships between entities may be 
described with a tree structure. 

Preferably, clustering algorithms including but not limited to, single linkage clustering algorithms, 
complete linkage clustering algorithms, and average linkage clustering algorithms are used to analyze 
the results from computatk>nal protein cycles described herein. Clustering algorithms may be used to 
form subsets using computationally generated energy matrices to measure energetic similarity (see 
Figure 6). Alternatively, clustering algorithms may t>e used to form subsets directly from a set of 
optimized protein sequences. 

In a preferred embodiment, a single-linkage clustering algorithm is used to form subsets from 
computationally generated energy matrices. An example of the use of a single-linkage clustering 
algorithm to form subsets from a computationally generated energy matrix is shown in Figures 5, 6, 
and 7. 

In alternative embodiments, a single linkage clustering algorithm is used to form subsets directly from 
a set of optimized protein sequences whereby the measure of similarity between two sequences is 
the extent of sequence identity. Alternatively, the measure of similarity between two sequences may 
be based on a standard sequence similarity comparison. As will be appreciated by those skilled in 
the art, similarity scores include but are not limited to BLOSUM similarity score, Dayhoff similarity 
score, PAM similarity score, etc. Specific examples of the aforementioned similarity scores include but 
are not limited to BLOSUM tables, 62 and 90; PAM tables: 260, etc., among others. In a preferred 
embodiment, subsets of designed protein sequences derived by clustering or related methods may be 
used to define multiple primary or secondary libraries. 

In an alternate embodiment, sets of sequences that may be recombined productively are defined as 
those that minimize disruption of sets of interacting or conrelated residues. Identification of sets of 
interacting residues may be carried out by a number of ways, e.g. by using known pattern recognition 
methods, comparing frequencies of occurrence of mutations or by analyzing the calculated energy of 
interaction among the residues (for example, if the energy of interaction is high, the positions are said 
to be correlated or interacting). These correlations may be positional correlations (e.g. variable 
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residue positions 1 and 2 always change together or never change together) or sequence correlations 
(e.g. if there is a residue A at positfon 1. there is always residue D at position 2). In addition, 
programs used to search for consensus motifs may be used. See: Loddess and Ranganathan 
Science 286595-299 (1999). Pattern discovery in Blomolecular Data: Tools. Techniques, and ' 
Applications, edited by Jason T.L Vteng. Bmce A. Shapiro. Dennis Shasha. New York: Oxford 
University. 1999; Andrews. Harry C. Introduction to mathematical techniques in patter recognitbn- 
New York. WHey-lnterscience (1972); Applications of Pattern Recognition; Editor. K.S. Fu. Boca 
Raton. Fla. CRC Press. 1982; Genetic Algorithms for Pattern Recognition; edited by Sankar K Pal 
Paul P. Wang. Boca Raton: CRC Press. c1996; Pandya. Abhijit S.. Pattern recognition with Neural' 
networks in C++/Abhijit 8. Pandya. Robert B. Macy. Boca Raton. Fla.: CRC Press. 1996; Handbook 
of pattern reoognitfon and computer visfon / edited by C.H. Chen. L.F. Pau. PS.P. Wang. 2"^ ed 
Signapore;^River Edge. N.J. : World Scientiffc. c1999; and Friedman. Introduction to Pattern 
Recognition : Statistical. Structural. Neural, and Fuzzy Logic Approaches; River Edge. N.J. : World 
Scientific. ci999, Series Title: Serien a machine perception and artificial intelligence; vol. 32. A II 
references cited herein are e^qiressly incorporated by reference. 

GENERAT ION OF SECONDARY LIBRARIg.«t 

As described herein, there are a wWe variety of methods to generate secondary libraries from primary 
libraries. The first is a selection step, where some set of primary sequences are chosen to form ttie 
secondary library. The second is a computational step, again generally including a selectnn step 
where some subset of the primary library is chosen and tiien subjected to furBier computational " 
analysis, including both protein design cycles as weU as techniques such as "in silk»" shuffling 
(recombination). The ttiird is an experimental step, where some subset of «ie primary libraiy is 
chosen and then recombined experimentelly to form a secondary library. 

SELECTING REQUENCES F OR THE SECONDARY LIBRARY 

In a preferred embodiment, ttie primary library of tiie scaffoU protein is used to generate a secondary 
library. The secondary library may then be generated and tested experimentally or subjected to 
furttier computational manipulation. A variety of approaches, including but not limited to ttrose 
described below, may be used to select sequences for the secondary library. Each approach may be 
used atone, or any combination of approaches may be used. As will be appreciated by ttiose in the 
art. the secondary library may be eittier a subset of «ie primary library, or contain new Dbraiy 
members, le. sequences ttiat are not found in tiie primary library. That is. in general, the variant 
positions and/or amino acid residues In the variant positions may be recombined in any number of 
w^ays to fom) a new libraiy that exptoits the sequence variations found in the primary library. In such 
embodiments, ttie secondary library win contain sequences ttiat were not included in ttie primary 
library. In all cases, if ttie secondary library is generated experimentally, it may optionally comprise 
one or more "enor- sequences, which result from experimental errors, as weU as one or more 
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sequences generated intentionaliy. That is, additional variabilis can be added to the secondary (or, 
in fact, to the primary library as well), either experimentally (e.g. through the use of error-prone PGR 
in secondary library sequences) or computationally (adding an "in silico" variant generation step to 
sample more sequence space). In the latter case, it is possible to introduce this additional level of 
variability in a random fashion (as used herein random includes variation introduced in a controlled 
manner or an uncontrolled manner) or in a directed fashion. For example, directed variability may be 
introduced by adding certain residues from a particular sequence, e.g. the hun^n sequence. 

Selecting a subset of the primary library 

As described herein, there are a wide variety of techniques that can be used to generate a secondary 
library. In a preferred embodiment, a subset of the primary library is used as the secondary library. 
This subset can be chosen in a variety of ways, as outlined herein. For example, similar to the 
primary library cut-off, an arbitrary numerical cut-off can be applied: the top X number of sequences 
forms the basis of the secondary library (or the top X number and . the bottom Y number, or any 
sequences in the top X number plus anything within Z energy of the wild- type sequence, etc. ). As will 
be appreciated by those in the art, there are a wide variety of relatively simple numerical cutoffs that 
can be applied. 

In a preferred embodiment, all amino acid residues are allowed at each variable residue position 
identified in the primary library. That is, once the variable residue positions are identified, a 
secondary library comprising every combination of every amino acid at each variable residue position 
is made. 

In a preferred embodiment subsets of amino acids are chosen to maximize coverage. Additbnal 
amino acids with properties similar to those contained within the primary library may be manually 
added. For example, if the primary library includes three large hydrophobic residues at a given 
position, the user may chose to include additional large hydrophobic residues at that position when 
generating the secondary library. In addition, amino acids in the primary library that do not share 
similar properties with most of the amino acids at a given position may be excluded from the 
secondary library. Alternatively, subsets of amino acids may be chosen firom the primary library such 
that a maximal diversity of side chain properties is sampled at each position. For example, if the 
primary library includes three large hydrophobic residues at a given position, the user may chose to 
include only one of them in the secondary library, in combination with other amino acids that are not 
large and hydrophobic. 

In a preferred embodiment, the primary library may be analyzed to determine which amino acid 
positions in the scaffold protein have a high mutational frequency, and which positions have a low 
mutation frequency. The secondary library may be generated by varying the amino acids at the 
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positions that have high numl)ers of mutations, while keeping constant the positions that do not have 
mutations above a certain firequeney. For example, if a position has less than 20% and more 
preferably less than 10% nuitations. it may be held invariant 

In a prefened embodiment, the secondary library is generated from a probability distribution table. As 
ouUined herein, there are a variety of methods of generating a probability distribution table, including 
using PDA™ technology output, the result^ of other energy calculation methods. (e.g. SCMF), and/or 
the results of knowledge- or sequence-based methods, all descp-bed previously. In addition, the 
probability distribution may be used to generate information entropy scores for each posifion, as a 
measure of the mutational frequency observed in the library. In this embodiment, the frequency of 
each amino acid residue at each variable residue position in the list is identified. Frequencies may be 
thresholded, wherein any variant firequency tower than a cutoff is set to zero. TWs cutoff is preferably 
1%, 2%. 5%, 10% or 20%. with 10% being particularty preferred. These frequencies may be built Into 
the secondary fibrary. so that the frequency at which each amino acid is present in the primary library 
is equal. wiHiin experimental error, to the frequency at which tiiat amino acW will be present in the 
secondary library. 

Recombination of Some or All Primarv Libr a ry Sequences to Generate a Secondan/ Library 
In an alternate embodiment, variable residue positrons may be recomblned to generate novel 
sequences to forni a secondary library. Thus, the secondary library comprises at least one member 
sequence and preferably a plurality of such member sequences not found in ttie primary library. 
Recombination may be performed experimentally and/or computationally using a variety of 
approaches. For example, a list of naturally occuning sequences may be used to calculate all 
possibte recombinant sequences, with an optional rank ordering or filtering step. Altematively, once a 
primary library is generated, one couM rank order only those recombinations ttiat occur at cross-over 
points wfth at least a threshoW of identity over a given window (for example. 1 00% identity over a 
contiguous 18 nucleotide sequence, or 80% identity over a contiguous 24 nucleotide sequence). 
Altematively, the homology could be considered at tiie DNA level, by computationally translating Hie 
amino adds to their respective DNA codons. Different codon usages could be considered. A 
preferred embodiment considers only recombinations with crossover points that have DNA sequence 
identity sufRctent for hybridization. 

In some embodiments. aU possible recombinant sequences are experimentally generated and tested. 
Altematively. in a prefened embodiment, ttie recombinant sequences are scored computationally and 
a subset of ttiese sequences are experimentally generated and tested. Computational screening of 
the set of recombinant sequences may be used to reduce Uie library to an experimentally tractable 
size and/or to enrich the library in sequences predfcted to possess desired properties. The 
recombinant sequences may be analyzed using methods including, but not restoicted to. ttjose 



58 



wo 03/014325 PCT/US02/25588 

methods used to generate and analyze primary library sequences, and by considering the role of 
clusters of interacting residues, as discussed below. 

In a preferred embodiment, the secondary library in generated by using any of the techniques outlined 
for primary library generation (SPA. PDA™, taboo, clustering, "In sllico" recombination, etc.) on the 
primary library that has been chosen. Particular combinations of computational analyses for primary 
and secondary libraries are outlined below. 

In a prefierred embodiment, the secondary library Is generated experimentally, using any number of 
the techniques outiined below, including gene assembly procedures. 

It is possible that some recombinant sequences will be inviable, that is, they will fail to fold, aggregate , 
possess other undeslred properties, or lacks desired properties. In certain cases, some algorithms 
will generate a plurality of local minima, the combination of which may lead to unsatisfactory 
sequences. 

However, computational screening approaches may be used to differentiate and bias or select for 
viable constructs from inviable constructs. For example, if recombining all library members is 
predicted to yield an excessive number of unviable sequences, subsets of a library could be 
recombined Instead. Strategies for identifying sets of sequences that may be productively 
recombined include, but are not limited to, clustering based on sequence identity or similarity, 
clustering based on similarity of the energy matrix, and identification of sets of interacting residues. 

As vtnll be appreciated by those in the art and outlined herein, probability distribution tables can be 
generated in a variety of ways. In addition to the methods outlined herein, self-consistent mean field 
(SCMF) methods can be used in the direct generation of probability tables. SCMF is a deterministic 
computational method that uses a mean field description of rotamer interactions to calculate energies. 
A probability table generated in this way can be used to create secondary libraries as described 
herein. SCMF can be used in three ways: the frequencies of amino acids and rotamers for each 
amino acid are listed at each position; the probabilities are determined directly from SCMF (see 
Delarue et la. Pac. Symp. Biocomput. 109-21 (1997), expressly incorporated by reference). In 
addition, highly variable positions and non-variable positions can be identified. Alternatively, another 
method is used to determine what sequence is jumped to during a search of sequence space; SCMF 
is used to obtain an accurate energy for that sequence; this energy is then used to rank It and create 
a rank-ordered list of sequences (similar to a Monte Carlo sequence list). A probability table showing 
the frequencies of amino ackis at each position can then be calculated from this list (Koehl et al., J. 
Mol. Biol. 239:249 (1994); Koehl et al., Nat. Struc. Biol. 2:163 (1995); Koehl et al., Cun^. Opin. Struct 
Biol. 6:222 (1996); Koehl et al., J. Mol. Bio. 293:1183 (1999); Koehl et al.. J, Mol. Biol. 293:1161 
(1999); Lee J. Mol. Biol. 236:918 (1994); and Vasquez Biopolymers 36:53-70 (1995); all of which are 
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expressly Incorporated by reference. Other forcefields that can be used in slmibr methods are 
outlined above. 

In addition, as outlined herein, a preferred method of generating a probability distribution table is 
through the use of sequence alignment programs. In addition, the probabflity table can be obtained 
by a combination of sequence alignments and computational approaches. For example, one can add 
amino acids found in the alignment of homologous sequences to the result of the computation. 
Preferable one can add the wOd type amino add Identity to the probability table if it is not found in the 
computation. 

Generation of Tertiary Libraries 

In a prefereed embodiment, a variety of additional steps may be done to one or more secondary 
libraries; for example, further computational processing may occur, secondary libraries may be 
recombined. or subsets of different secondary libraries may be combined. 

In a preferred embodiment, a tertiary library can be generated finom combining secondary libraries. 
For exampte, a probability distribution teble from a secondary library can be generated and 
recombined, whetiier computationally or experimentally, as outlined herein. A PDA secondary library 
may be combined with a sequence alignment secondary library, and either recombined (again, 
computationally or experimentally) or just the cutoffs from each joined to make a new tertiary library. 
The top sequences from several libraries can be recombined. Primary and secondary libraries can 
similarty be combined. Sequences from the top of a library can be combined with sequences from ttie 
bottom of tiie library to more broadly sample sequence space, or only sequences distent from the top 
of ttie library can be combined. Primary and/or secondary libraries that analyzed different parts of a 
protein can be combined to a tertiary library that treats the combined parts of the protein. These 
combinations can be done to analyze large proteins, especially large multidomain proteins or 
complete protoesomes. 

In a preferred embodiment, a tertiary library can be generated using correlations in tire secondary 
library. That is, a residue at a first variable positton may be correlated to a residue at second variable 
position (or correlated to residues at additional positions as weH). For example, two variable positions 
may sterically or electrosteOcally interact, such that if tiie first residue is X. ttie second residue must 
beY. This nray be either a positive or negative conelation. This correlation, or "cluster" of residues, 
may be bottt detected and used in a variety of ways. (For the generation of correlations, see the 
earlier cited art). 

In addition, primary and secondary libraries can be combined to form new Bbraries; ttiese can be 
random combinations or tiie libraries, combining ttie "top" sequences, or weighting ttie combinations 
(positions or residues from ttie first library are scored higher ttian ttiose of tfie second library). 
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Additional variability can be added to the tertiary library as well), either experimentally (e.g. through 
the use of en-or-prone PGR In tertiary library sequences) or computationally (adding an "in silico" 
variant generation step to sample more sequence space). In the latter case, it Is possible to introduce 
this additional level of variability in a random fashion (as used herein random includes variation 
introduced in a controlled manner or an uncontrolled manner) or in a directed fashion. For example, 
directed variability may be introduced by adding certain residues from a particular sequence, e.g. the 
human sequence. 

In a prefenred embodiment, when two computational steps are used (e.g. a PDA^ step to generate a 
primary library and in silico shuffling or a probability table to generate a secondary library), the 
experimental generation of the secondary library can result in a tertiary library, that is, a library that 
contains members not found in the secondary library. Alternatively, the tertiary library may just be a 
subset of the secondary library as outlined above. 

In a prefenred embodiment, a secondary library may be computationally remanipulated to form an 
additional secondary library (sometimes referred to herein as "tertiary libraries"). For example, any of 
the secondary library sequences may be chosen for a second round of PDA™ technology 
calculations, by freezing or fixing some or all of the changed positions in the first secondary library. 
Alternatively, only changes seen in the last probability distribution table would be allowed. 
Alternatively, the stringency of the probability table may be altered, either by increasing or decreasing 
the cutoff for Inclusion. 

In a preferred embodiment, the sequence information derived from experimental screening of a 
secondary library could be used to guide the design for the tertiary library. In this way, the library 
generation is an iterative process. In a prefenred embodiment, the tertiary library could be derived by 
computationally screening the secondary library for desired protein properties as previously 
mentioned. 

Experimentally Making the Library 

Once a library is generated using any of the methods outlined herein or combinations thereof, the 
library (or a tertiary, quaternary, etc. library) is made any number of techniques, including using gene 
assembly procedures. Accordingly, the present invention provides methods for making protein 
libraries in any of a variety of different ways. 

Chemical synthesis of proteins 
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In a preferred embodiment, different protein members of the secondary fibraiy may be chemically 
synthesized. This is particularly useful when the designed proteins are short, preferably less than 150 
amino acids in length, with less than 100 amino acids being preferred, and less than 50 amino adds 
being particularfy preferred, although as is known in the art. longer proteins may be made chemically 
or enigmatically. 

These amino acid sequences could then be joined together via chemical ligation to form larger 
proteins as needed (see Yan. L and Dawson. P.E. J. Am. Chem. Soc. 123 (2001) 526^33. and 
Dawson. P.E. and Kent. S.B.H. Ann. Rev. Biochem. 69. (2000) 923-960), hereby expressly 
incorporated by reference. Furthermore, peptides corresponding to sequences from different library 
members could be shuffled or randomly ligated together to form a secondary library. For example 
one or more peptides with different amino add sequences fiom the N-terminal region of the protein 
could be ligated to one or more peptides with different amino acid sequences from the Cterminal 
region of the protein. Such an assembly could be repeated for several ftjrther rounds of synthesis. 
Using such a method, a secondary library could be chemically synthesized. 

In a preferred embodiment, proteins could be constructed by diemically synthesis of peptides and 
formed by ligation of the peptides using intein technology (Evans et al. (1999) J. Biol. Chem. 274 
18359-18363; Evans et al. (1999) J. Biol. Chem. 274. 3923-3926; Mathys et al. (1999) Gene 231. 1- 
1 3; Evans et al. (1998) Protein Sd. 7.2255-2264; Southworth et al. Biotediniques 27. 1 10-120). 

Generatinq nucleic acids that encode sing le membeis nf a uhr^ry 

In a preferred embodiment, particularly for tonger proteins or proteins for whidi large samptes are 
desired, the secondary Hbrary sequences are used to create nudeic adds sudi as DNA whidi encode 
the member sequences and whidi may then be doned into host cells, expressed and assayed, if 
desired. Thus, nudeic adds, and particularly DNA. may be made which encodes eadi member 
protein sequence. This is done using well-known procedures. See Maniatis and current protocols 
(see Current Protocols in Motecular Biology. VWley & Sons, and Motecular Cloning - A Laboratory 
Manual - 3"" Ed. . CoW Spring Harbor Laboratory Press. New York (2001)). The choice of codons 
suitable expression vectors and suitebte host cells will vary depending on a number of fadors. and 
may be easily optimized as needed. 



Gene Assembly Procgdnrpg 

AS wfll be appredated by those in the art. the generation of exact sequences for a library comprising a 
large number of sequences (despite the fact that the set number is much smaller than the original set) 
is still potenttaily expensive and time consuming. Accordingly, in a preferred embodiment, there are a 
variety of gene assembly techniques that may be used to generate the secondary or higher order 
libraries of the present invention. As discussed herein, these experimentally generated libraries 
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generally recombine sequences within the library, resulting in sequences present in the original library 
as well as recombined connbinations of those sequences. 

Gene Assembly Using Pooled Oligonucleotides 

In a preferred embodiment, multiple amplification reactions with pooled oligonucleotides are done, as 
is generally depicted in Figure 12, comprising variant protein sequences created by the assembly of 
gene fragments generated from a nucleic acid template. This generally involves generating variant 
protein sequences created by the assembly of gene fragments generated from a nucleic acid 
template. They can be full length "overlapping" oligonucleotides, or primers. In one embodiment, 
overlapping oligonucleotides are synthesized which correspond to the full-length gene. As may be 
appreciated by one skilled In the art, these oligonucleotides may represent all of the different amino 
acids at each variant position or subsets. Once these oligonucleotides are made, they are 
reassembled into a set of variable sequences in any number of ways, outlined below. While the 
reactions described below focus on PCR as the amplification techniques, others are included as is 
generally outlined below. 

In general, the invention may take on a wide variety of configurations. For example, libraries of 
nucleic acids encoding all or a subset of possible proteins are generated by assembling nucleic acid 

ft 

fragments. Preferably, the gene fragments are linked together using an enzymatic or non-enzymatb 
method for the ligation of gene fragments. For example, for each gene fragment, a pair of donor 
fragments is generated such that the sense strand from one donor fragment complements the 
antisense strand of the other donor fragment and creates a 5'-phosphorylated overhang when the two 
strands are hybridized under conditions that allow for the formation of a double stranded molecules. 
The 5' phosphorylated overhang is located at one of the 5' ends of the resulting double stranded 
molecule to allow ligation to a free 3'- terminus of an adjacent gene fragment. In some embodiments, 
5'-phosphorylated overhangs are generated at both ends, preferably with unique sequences to 
prevent setf-ligation. 

Chemically synthesized oligonucleotides are used as primers for the generation of donor fragments. 
For each pair of donor fragments, one primer is labeled at the 5*-end with a purification tag. The 
purification tag may be a his, myc, flag, or HA tag or a fusion protein may be used instead, for 
example gst, thioredoxin, nusA, among others known in the art. Preferably, the purification tag is 
biotin. The other primer is designed to bind to the other nriember of the donor fragment pair to create 
a 5'-phosphorylated overhang, from about 1 to 20 or more base pairs in length. 

In a preferred embodiment, at least one of the populations of nucleic acid fragments comprise variant 
sequences that result in the formation of a variant nucleic acid sequence. In a further embodiment, 
both the 5 -phosphorylated primer and at least one of the populations of nucleic acid fragments are 
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used to generate variant nucleic acid sequences. In a preferred emlxjdiment. ligation substrates are 
formed from at least two different donor fragment pairs. The donor fragment pairs may be generated 
from the same template or from different templates. 

In a preferred embodiment, the ligation product is generated using the following steps: (1) generating 
at least two donor fragments from a template molecule using primer dependent DNA polymerization 
wherein one strand comprises a purification tag and the other strand comprises a 5'-phosphorylated 
. overhang; (2) removing sfrands lagged with a purification tag using a suitable capture molecule; (3) 
annealing the remaining 5'-phosphorylated strand to fomi first and second ligation subsfrates; and. (4) 
ligating said first and second ligation substrates after annealing sfrands with 6' phosphorylated 
overhangs to generate nudeicacid motecufes encoding variant proteins, (see Kneidlnger. Graininger 
and Messner. Kotechniques 30: 249-252 (2001); Au. Yang, Yand. Lo. and Kao; Biochem Biophys 
Res Comm 248: 200-203 (1998)). Each of the above-cited references are herein expressly 
Incorporated by reference. This method is more fully described in U.S. Pat No. 6,1 10 668 and 
VV09815567. 

In a preferred embodiment, the donor fragments are generated using modified primers and a 
polymerase. The nudeic acid template may be single stranded (i.e. M13 DNA) or double sfranded 
(i.e., plasmid, genomic, or cDNA). The overall design of the primers will depend on the linkage 
scheme between the donor fragments. For example. (Rgure 20) Olusfrates the contitilled linkage 
between two neighboring fragnrients A and 8. Initially for each gene fragment, a pair of donor 
fragments is generated (DFA1/DFA2 and DFB1/DFB2). The donor fragment pairs are designed such 
that the sense sfrand from one donor fragment DFA1 or DFB1 , complements the antisense sfrand of 
the other donor fragment, DFA2 or DFB2. and creates a 5'-phosphorylated overhang on the hybrid 
product of the corresponding two sfrands. The overhang is located on the skie where two 
neighboring gene fragments are to be joined. The sequence of the overhang is a sequence that 
belongs either to the 3'-end of fragment A or the 5'-end of fragment B (in Figure 10. it belongs to B 
FIX). The sfrands not used to fonm the sticky end hybrid molecule are removed using a purif»atk>n . 
tag. 

In a preferred embodiment the sfrands not used to form the stfcky end are removed using 
biotin/sfreptavidin capture technotogy as is known in the art In an alternative embodiment a 5'- 
phosphorylated primer is incorporated on the sfrand to be removed. foUowed by digestion of this 
Sfrand with lambda exonuclease. Subsequent 5•^)hosphorylation of the remaining sfrand will allow 
formation of a hybrid molecule with a phosphorylated overhang. 

In a prefen-ed embodiment equimolar amounts of the corresponding singte sfrands of the donor 
fragments are combined under conditrans suitable to renature double sfranded molecutes (A/A* and 
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B/B'), with a 5 -phosphorylated overhang. Preferably* these double stranded molecules, also referred 
to herein as ligation substrates are joined using enzymatic or non enzymatic ligation to fonn a nucleic 
acid ligation product that encodes a protein variant Alternatively, the ligation substrate is not ligated, 
but Instead is used as a source of donor fragments and the process repeated. 

The following U.S. patents are incorporated herein in their entirety: U.S. Patent No. 6,188,965; U.S. 
Patent No. 6,269,312; and U.S. Patent No. 6,403,312. The following U.S. patent applications are 
incorporated herein In their entirety: U.S.S.N. 09/927,790, filed August 10, 2001 and U.S.S.N. 
10/101,499. filed March 18, 2002. 

In a preferred embodiment, the oligonucleotides are pooled in equal proportions and multiple PGR 
reactions are performed to create Hill length sequences containing the combinations of mutations 
defined by the secondary library. In addition, this may be done using methods that introduce 
additional variations, such as error-prone amplification (e.g. PGR) methods or by Intentionally 
introducing other variables. 

In a preferred embodiment, the different oligonucleotides are added in relative amounts 
corresponding to either a probability distribution table or to an arbitrary or computationally derived 
formula. The multiple PGR reactions thus result in full length sequences with the desired 
combinations of mutations in the desired proportions. 

The total number of oligonucleotides needed is a function of the number of positions being mutated 
and the number of mutations being considered at these positbns: 

(number of oligos for constant positions) -i- Ml M2 + M3 Mn = (total number of oligos required), 
where Mn is the number of mutations considered at position n in the sequence. 
In a preferred embodiment, each overlapping oligonucleotide comprises only one position to be 
varied; in alternate embodiments, the variant positions are too close together to allow this and multiple 
variants per oligonucleotide are used to allow complete recombination of all the possibilities. That is, 
each oligo can contain the codon for a single position being mutated, or for more than one position 
being mutated. The multiple positions being mutated must be close in sequence to prevent the oligo 
length ft-om being impractical. For multiple mutating positions on an oligonucleotide, particular 
combinations of mutations can be included or excluded in the library by including or excluding the 
oligonucleotide encoding that combination. For example, as discussed herein, there may be 
correlations between variable regions; that is, when position X is a certain residue, position Y must (or 
must not) be a particular residue. These sets of variable positions are sometimes referred to herein 
as a "cluster". When the dusters are comprised of residues close together, and thus can reside on 
one oligonuclotide primer, the clusters can be set to the "good" correlations, and eliminate the bad 
combinations that may decrease the effectiveness of the library. However, if the residues of the 
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duster are far apart in sequence, and thus wfll reside on different oligonuclotides for synthesis, it may 
be desirable to either set the residues to the "good' correlation, or eliminate them as variable residues 
entirely. In an alternative embodiment, the library may be generated in several steps, so that the 
cluster mutations only appear together. This procedure, i.e.. the procedure of identifying mutation 
clusters and either placing them on the same oligonucleotides or eliminating them from the Bbraiy or 
Bbrary generation in several steps preserving clusters, can considerably enrich the experimental 
libnary with property folded protein. Identification of clusters can be carried out by a number of ways. 
e.g. by using Icnown pattern recognition nrethods. comparisons of frequencies of occurrence of 
mutations or by using energy analysis of tiie sequences to be experimentally generated (for example, 
if the energy of interaction is high, ttie positions are correlated), these correlations may be positional ' 
correlations (e.g. variable positions 1 and 2 always change together or never change togetfier) or 
sequence correlations (e.g. if there is a residue A at position 1. there is always residue B at position 
2). See: Pattern discovery in Biomolecular Data: Tools. Techniques, and Applications; edited by 
Jason T.L W^ng. Bruce A. Shapiro. Dennis Shasha. New Yoric Oxford Unvieisity, 1999; Andrews. 
Harry C. Introduction to mathematical techniques in patter recognition; New Yori<. Wiley ^lnterscieni» 
[1972]; AppUcations of Pattern Recognition; Editbr. K.S. Fu. Boca Raton. Fla. CRC Press. 1982; 
Genetic Algorittims for Pattern Recognition; edited by Sankar K. Pal. Paul P. Vteng. Boca Ratoii : 
CRC Press, c1996; Pandya, Abhijit S.. Pattern recognition witii Neural networira in C++/Abhijit S. 
Pandya. Robert B. Macy. Boca Raton. Fla.: CRC Press. 1996; Handbook of pattern recognition and 
computer vision / edited by C.H. Chen. LF. Pau. P.S.P. Vteng. 2nd ed. Signapore ; River Edge. N.J. 
: Worid Sdentific. c1999; Friedman. Introduction to Pattern Recognition : Statistical. Structural, 
Neural, and Fuzzy Logic Approaches ; River Edge. N.J. ; Worid Scientific. c1999. Series titie: Serien a 
machine perception and artificial intelligence; vol. 32; aU of wh ich are expressly incorporated by 
reference. In addition programs used to search for consensus motifs can be used as weH. 

In addition, correiab-ons and shuffling can be fixed or optimized by altering the design of ttie 
oligonucleotides; that is. by deciding where ttie oligonucleotides (primers) start and stop (e.g. where 
the sequences are "cut"). The start and stop sites of oligos can be set to maximize the number of 
clusters that appear in single oligonucleotides, ttiereby enriching the library with higher scoring 
sequences. Different oligonucleotides start and stop site options can be computationalV modeled 
and ranked according to number of clusters tiiat are represented on single oligos. or tiie percentage 
of the resulting sequences consistent witti the predicted libarary of sequences. 

The total number of oligonucleotides required increases when multiple mutable positions are encoded 
by a single oligonucleotide. The annealed regions are ttie ones that remain constant. i.e. have the 
sequence of the reference sequence. 

Oligonucleotides wittr inserttons or deletions of codons can be used to create a Fibrary expressing 
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different length proteins, in particular computational sequence screening for insertions or deletions 
can result in secondary libraries defining different length proteins, which can be expressed by a library 
of pooled oligonucleotide of different lengths. 

Preferably, an individual gene that serves as the template nucleic acid is obtained from at least two 
different species. In this embodiment, the gene from one species is cloned into a vector to produce a 
template molecule comprising single stranded nucleic acid molecules. The DNAfrom the second 
species is cleaved into fragments. The resulting fragments are added to the template molecule under 
conditions that permit the fragments to anneal to the template molecule. Unhybridized termini are 
enzymatically removed. Gaps between hybridized fragments are filled using an appropriate enzyme, 
such as a polymerase and nicks sealed using a ligase. The chimeric gene can be amplified using 
suitable primers or other techniques that are well known to those of skill in the art. 

In a preferred embodiment, sequences derived from introns are used to mediate specific cleavage 
and ligation of discontinuous nucleic acid molecules to create libraries of novel genes and gene 
products as described in U.S. Patent Nos. 5,498,531, and 5,780,272, both of which are hereby 
expressly incorporated by reference in their entirety. In one embodiment, a library of ribonucleic acids 
encoding a novel gene product or novel gene products is created by mixing splicing constructs 
comprising an exon and 3' and 5' intron fragments. See U.S. Patent No. 5,498,531. 

In another embodiment, DNA sequence libraries are created by mixing DNA/RNA hybrid molecules 
that contain intron derived sequences that are used to mediate specific cleavage and ligation of the 
DNA/RNA hybrid molecules such that the DNA sequences are covalently linked to form novel DNA 
sequences as described in U.S. Patent No. 6,150,141, WO 00/40715 and WO 00/17342, all of which 
are hereby expressly incorporated by reference in their entirety. 

In a preferred embodiment, the secondary library is done by shuffling the family (e.g. a set of 
variants); that is, some set of the top sequences (if a rank-ordered list is used) can be shuffled, either 
with or without error-prone PGR. "Shuffling" in this context means a recombination of related 
sequences, generally in a either a targeted or random way. It can include "shuffling" as defined and 
exemplified in U.S. Patent Nos. 5,830,721; 5.811,238; 5,605,793; 5,837,458 and POT US/1 9256, all 
of which are expressly incorporated by reference in their entirety. This set of sequences can also be 
an artificial set; for example, from a probability table (for example generated using SCMF) or a Monte 
Carlo set Similarly, the "family" can be the top 10 and the bottom 10 sequences, the top 100 
sequences, etc. This may also be done using en^or-prone PGR. 

Thus, in a preferred embodiment, in sliico shuffling is done using the computational methods 
described therein. That is, starting with either two libraries or two sequences, random recombinations 
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Of the sequences can be generated and evaluated computationally, and then experimental libraries 
generated. 

PGR with pooled ollaos 

Use of pooled oligos for synthetic shuffling is more fully described in U.S. Pat No. 6,368,861 (see 
also US6423542; US6376246; US6368861; US6319714; WO0042561A3; WO0042561A2; 
WD0042560A3; WO0042560A2; WO0042559A1 ; WO0018906C2: WO0018906A3; and 
WO0018906A2.) 



In a preferred embodiment. PCR using a wild type gene or other gene may be used, as is 
schematically depicted in Figure 15. in this embodiment, a starting gene is used: the gene may the 
wild-type gene, the gene encoding the global optimized sequence, or any other sequence of the list. 
In this embodiment, oligonucleotides are used that correspond to the variant positions and contain the 
different amino acids of the secondary library. PCR is done using PCR primers at the tennlni. as is 
known in the art PCR provides many benefits namely, fewer oligonucleotides, may result in fewer 
enrors. and if the wild type gene is used, it need not be synthesized. An alternative method for 
creating members of the library, are ligase chain reaction-based methods, (see Chalmers and 
Cumow. Biotechniques 30 (2001) 24&-252). which in herein expressly incorporated by reference. 



In a preferred embodiment these oligonucleotides are pooled in equal proportions and multiple PCR 
reactions are perfonned to create full-length sequences containing the combinations of mutations 
defined by the secondary library. In a prefenred embodiment, the different oligonucleotides are added 
in relative amounts, e.g. in amounts corresponding to a probability distribution table, an alignment, or 
other parameters. The multiple PCR reactions thus result in full-length sequences with the desired 
combinations of mutations in the desired proportions. 



Number of mutations per ofiqo 

In a preferred embodiment each overiapping oligonucleotide comprises at least one or more 
positions to be varied and zero or more positions that are not varied. As may be appreciated by one 
skilled in the art the distance between multiple variants may affect the completeness of 
recombination of all possible library members. That is, each ofigo may contain the codon for a single 
posifion being mutated, or for more than one position being mutated. For multiple mutating positions 
on an oligonucleotide, particular combinatfons of mutatfons may be included or excluded in the library 
by including or excluding the oligonucieotkle encoding that combinatfon. The total number of 
oligonucleotides required increases when mulBpte muteble positions are encoded by a single 



68 



wo 03/014325 



PCT/US02/25588 



Oligonucleotide. The annealed regions are the ones that remain constant i.e. have the sequence of 
the reference sequence. 

Random codons 

In some cases, oligos with random mutations may be used. That is, any amino acid may be 
represented at a codon position. As known by those skilled in the art, subsets of random codons may 
be used, where the bias is for or against specific amino acids. By judicial design, certain amino acids 
may be favored or excluded from the set of possible mutations. 

Multiple DNA libraries may be synthesized that code for different subsets of amino acids at certain 
positions, allowing generation of the amino acid diversity desired without having to fully randomize the 
codon and thereby waste sequences in the library on stop codons, frameshifts, undesired amino 
acids, etc. This may be done by creating a library that at each position to be randomized is only 
randomized at one or two of the positions of the triplet, where the position(s) left constant are those 
that the amino ackis to be considered at this position have In common. Multiple DNA libraries may be 
created to insure that all amino acids desired at each position exist in the aggregate library. 
Alternatively, shuffling, as is generally known in the art, may be done with multiple libraries. 
Alternatively, the random peptide libraries may be done using the frequency tabulation and 
experimental generation methods including, multiplexed PGR, shuffling, and the like. 

Error-prone PGR 

In a preferred embodiment, error-prone amptificatfon methods (e.g. error prone PGR) Is done to 
generate additional members of the secondary library, or the whole library. See U.S. Patent Nos. 
5,605.793, 5,811,238, and 5,830,721, all of which are hereby incorporated by reference. This may be 
done on the optimal sequence or oh top members of the library, or some other artificlai set or ^mily. 
Error prone PGR is then performed on the optimal sequence gene in the presence of oligonucleotides 
that code for the mutations at the variable residue positions of the secondary library (bias 
oligonucleotides). The addition of the oligonucleotides will c reate a bias favoring the incorporation of 
the mutatbns In the secondary library. Alternatively, only oligonucleotides for certain mutations may 
be used to bias the library. 

In addition to error-prone PGR. mutations could be introduced in specific regions using minor 
modifications to several other methods, either in vitro or in vivo, including but not limited to "DNA 
shuffling" (see WO 00/42561 A3; WO 01/70947 A3;), exon shuffling (see US 6365 377 B1; Kolkman & 
Stemmer (2001) Nature Biotechnology 19, 423-428), family shuffling (see Grameri et al. (1998) 
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Nature 391. 288-291; US 6376248 B1). RACHITT™ (Coco et al. (2001) Nature Biotechnology 19. 
354- 359; WO 02/06469 A2), STEP and random pruning of in vitro recoint)lnation (see aao et al.. 
(1998) Nature Biotechnology 16. 258-261; Shao et al (1998) Nudeic Acids Research 26, 681-683; 
exonuclease nfiediated gene assembly (US 6352842 B1, US 6361974 81). Gene Site Saturation ' 
Mutagenesis'" (US 6358709 B1). Gene Reassembly™ (US 6358709B1) and SCRATCHY (Lutz et 
al.(2001). PNAS 98. 11248-11263). DNA fragmentafion methods (Wkuchi et al.; Gene 236, 159-167). 
single-stranded DNA shuffling (Kikuchi et al.. (2000) Gene 243. 133-1 37). Although these methods 
are intended to introduce random mutations throughout the gene, those skilled in the art will 
appredate that spedffc regions (those defined by computational methods such as PDA™ technology: 
see WO 01/75767) of the gene could be mutated, whilst others could be left untouched, either by 
isolating and combining the mutated region with the unmodified regfon (for example, by cassette 
mutagenesis: see WO 01/75767 A2; Kim & Mass. (2000) Biotediniques 28. 196-198; Lano & Jeltsch 
(1998) Bfotechnkiues 25. 958- 965; Ge & Rudolph (1997) Biotechniques 22. 28- 30; Ho etal.. (1989) 
Gene 77. 51059). or via in vitro or in vivo recombination (see for example see WO 02/10183 Al and 
Abecassis et al., (2000) Nucleic Acids Research 28. e88 for examples). All of the above-dted 
references are hereby expressly incorporated by reference, in addition, it should be noted that the 
computational equivalents of aU of these methods can be used as a computational step to generate 
primary and/or secondary libraries. That is. "in sBico" shuffling of a primary Hbrary rank-ordeied list 
may be further "shuffled" using experimental procedures. 



Additional methods for gene construction 

The creatfon of members of the secondary library may be performed by several other methods, 
induding. but not limited to, dassical site-directed mutagenesis, e.g. Quid<change commercially 
available fi-om Stratagene. cassette mutagenesis as well as other amplificatfon technkjues. Cassette 
mutagenesis couW indude the creation of DNA molecules from restriction digestion fragments using 
nuciek: add ligation, and includes the random ligation of restriction fragments (see Kikuchi et al., 
(1999). Gene 236. 159-167). Additfonally. cassette mutagenesis coukJ also be achieved using 
randomly-cleaved nuciek: adds (see Kikuchi et al., (1999). Gene 236. 133-137). by PCR-ligation PCR 
mutagenesis (see for example Ali & Steinkasserer (1995). Biotechnkjues 18. 746-750). by seamless 
gene engineering using RNA- and DNA- overhang doning (see Roc & Doc; Coljee et al.. (2000) 
Nature Bfotedinology 18, 789-791). by.ligatfon mediated gene construction (U.S.S.N. 60/311.545), by 
homok^ous or non-homologous random recombination (see US6.368,861; US6423642: US6376246; 
US6368861; US6319714; WO0042661A3; WO0042561A2; WO0042560A3; W0004256'oA2; 
WO0042559A1 ; WO0018906C2; WO0018906A3: and WO0018906A2). or in vivo using 
recombination between flanking sequences (see WO 02/10183 Al and Ab6cassis et al., (2000) 
Nuclefe Adds Researdi 28. e88 for examples). In addition, regions of the gene could be mutated in E. 
coli laddng correct mismatdi repair mechanisms, (e.g. E.aA\XLmutS strain commerdally available 
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from Stratagene), or by using phage display techniques to evolve a library (e.g. Long-McGie et al., 

(2000) . Biotechnol Bioeng 68. 121-125). 

In addition to the PGR methods outlined herein, there are other amplification and gene synthesis 
methods that can be used. For example, the library genes may be ''stitched" together using pools of 
oligonucleotides with polymerases (and optionally or solely) ligases. These resulting variable 
sequences can then be amplified using any number of amplification techniques, including, but not 
limited to, polymerase chain reaction (PGR), strand displacement amplification (SDA), nucleic acid 
sequence based amplification (NASBA). ligation chain reaction (LCR) and transcription mediated 
amplification (TMA). tn addition, there are a number of variations of PGR which may also find use in 
the invention, including "quantitative competitive PGR" or "QC-PGR". "arbitrarily primed PGR" or "AP- 
PGR" , "immuno-PGR", "Alu-PGR". "PGR single strand conformational polymorphism" or TGR- 
SSGP". "reverse transcriptase PGR" or "RT-PGR". "biotin capture PGR", "vectoretle PGR", "panhandle 
PGR", and "PGR select cDNA subtration". among others. Furthermore, by incorporating the T7 
polymerase initiator into one or more oligonucleotides. iVT amplification can be done. 

Experimental Modification of Libraries to Generate Further Libraries 

It will be appreciated by those skilled in the art that many of the methods used to construct the 
secondary libraries can be used in further modifications. For example, cassette mutagenesis could 
include the creation of DNA molecules from restriction digestion fragments using nucleic acid ligation, 
and Includes the random ligation of restriction fragments (see Kikuchi et al.. (1999), Gene 236, 159- 
167). Additionally, cassette mutagenesis couki also be achieved using randomly -cleaved nucleic 
acids (see Kikuchi et al.. (1999), Gene 236. 133-137). by PGR-Iigation PGR mutagenesis (see Ali & 
Steinkasserer (1995). Biotechnrc|ues 18, 746-750), by seamless gene engineering using RNA- and 
DNA- overhang cloning (Roc & Doc; Goljee et al., (2000) Nature Biotechnology 18. 789-791), by 
ligation mediated gene construction (U.S.S.N. 60/311,545). by homologous or non-homologous 
random recombination (see US6,368,861; US6423542; US6376246; US6368861; US6319714; 
VVO0042561A3; WO0042561A2; WO0042560A3; WO0042560A2; WO0042559A1; WO0018906G2; 
WO0018906A3; and WO0018906A2). 

Tertiary libraries could be created from secondary libraries using any of the techniques outlined herein 
or one or more of the following, either in a step-wise fashion or in combination: DNA shuffling ( see 
WO 00/42561 A3; WO 01/70947 A3;), exon shuffling (see US 6365 377 B1 ; Kolkman & Stemmer 

(2001) Nature Biotechnology 19, 423-428), Family Shuffling (see Grameri et al. (1998) Nature 391, 
288-291; US 6376246 B1). RAGHITT^ (see Goco et al. (2001) Nature Biotechnology 19, 354- 359; 
WO 02/06469 A2), STEP and random priming of in vitro recombination (see Zhao et ah, (1998) 
Nature Biotechnology 16. 258-261; Shao et al (1998) Nucleb Acids Research 26. 681-683; 
exonuclease mediated gene assembly (see US 6352842 B1 . US 6361974 B1). Gene Site Saturation 
Mutagenesis^ (see US 6358709 B1), Gene Reassembly^ (see US 6358709B1) and SGRATGHY 
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(see Lutz et al.(2001). PNAS 98. 11248-11253), DNA fragmentafon methods (see Wkuchi et al.. 
Gene 236, 159-167), single-stranded DNA shuffling (see Kikuchi et al., (2000) Gene 243. 133-137). „ 
vitro or in vivo recombination (see WO 02/1 01 83 Al and Abecassis et a!., (2000) Nucleic Acids 
Research 28. e88 for examples). Additionally, in vivo mutagenesis could be performed in strains of 
£co// that lack correct DNA mismatch repair mechanisms, e.g. Ecoli XLmutS strain commercially 
available from Stratagene. or by using phage display technkjues to evolve a Ubrary (e.g. Long-McGie 
etal.. (2000), Bfofechnol Bioeng 68, 121-125). 



Preferred Combinations 

In general, as more fully outlined below, the invention can take on a wide variety of configurations. In 
general, primary libraries. e.g. Sbraries of all or a subset of possible proteins are generated 
computationally. This can be done in a wide variety of ways, including sequence alignments of 
related proteins, structural alignments, stmctural prediction models, databases, or (preferably) protein 
design automatton computatnnal analysis. Similarly, primary libraries can be generated via 
sequence screening using a set of scaffoW structares that are created by perturbing the starting 
structure (using any number of technkjues such as molecular dynamics. Monte Carlo analysis) to 
make changes to the protein (including backbone and sidechain torsion angle changes). Optimal 
sequences can be selected for each starting structures (or. some set of the top sequences) to make 
primary libraries. 

Some of these technkjues result in the list of sequences in the primary library being "scored", or 
"ranked" on the basis of some particular criteria. In some embodiments, lists of sequences that are 
generated without ranking can then be ranked using technkjues as outlined below. 

In a preferred embodiment, some subset of the primary library is then experimentally generated to 
fomfi a secondary library. Alternatively, some or aB of the primary Ubrary members are recomblned to 
Ibnn a secondary Bbrary. e.g. with new members. Again, this may be done either computationally or 
experimentally or both. 

Alternatively, once the primary library is generated, it can be manipulated in a variety of ways. In one 
embodiment, a different type of computational analysis can be done; for example, a new type of 
ranking may be done. Alternatively, and the primary library can be recomblned, e.g. residues at 
different positk>ns mixed to fonu a new. secondary library. Again, this can be done either 
computationally or experimentally, or both. 
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As will be appreciated by those in the art. there are a number of specific combinations that can be 
used with the methods of the present Invention. Examples of some preferred combinations are 
shown in Figures 21 A-E. 

Expression Systems 

The library proteins of the present Invention are produced by culturing a host cell transfomied with 
nucleic acid, preferably an expression vector, containing nucleic acid encoding a library protein, under 
the appropriate conditions to induce or cause expression of the library protein. The conditions 
appropriate for library protein expression will vary with the choice of the expression vector and the 
host cell, and wilt be easily ascertained by one skilled in the art through routine experimentation. For 
example, the use of constitutive promoters in the expression vector will require optimizing the growth 
and proliferation of the host cell, while the use of an inducible promoter requires the appropriate 
growth conditions for induction. In addition, in some embodiments, the timing of the harvest is 
important. For example, the baculoviral systems used in insect cell expression are lytic viruses, and 
thus harvest time selection can be crucial for product yield. 

Examples of expression systems 

As will be appreciated by those in the art, the type of cells used In the present invention can vary 
widely. The lists that follow are applicable both to the source of scaffold proteins as welt as to host 
cells in which to produce the variant libraries. A wide variety of appropriate host cells can be used, 
including yeast, bacteria, archaebacteria, fungi, and insect, plant and animal cells, including 
mammalian cells. Of particular interest are Drosophila melanogaster cells, Saccharomyces cerevisiae 
and other yeasts, £ co//, Bacillus subtilis. Streptococcus cremoris, Streptococcus IMdans, pED 
(commercially available from Novagen), pBAD and pCNDA (commercially available from Invitrogen), 
pEGEX (commercially available from Amersham Biosciences), pQE (commercially available from 
Qiagen), SF9 cells, C129 cells, 293 cells, Neurospora, BHK, CHO, COS, and HeLa cells, fibroblasts, 
Schwanoma cell lines,- imnrK)rtalized mammaPian myeloid and lymphoid cell lines, Jurkat cells, n^st 
cells and other endocrine and exocrine cells, and neuronal cells. See the ATCC cell line catalog, 
hereby expressly incorporated by reference. In one embodiment, the cells may be genetically 
engineered, that is, contain exogenous nucleic acid, for example, to contain target nrK>lecuies. 

In a preferred embodiment the library proteins are expressed in mammalian expression systems, 
including systems in which the expression constructs are introduced into the mammalian cells using 
virus such as retrovirus or adenovirus. Any mammalian cells may be used, with mouse, rat, primate 
and human cells being particularly prefen^ed, although as will be appreciated by those in the art, 
modifications of the system by pseudotyping allows all eukaryotic cells to be used, preferably higher 
eukaryotes. Accordingly, suitable mammalian cell types include, but are not limited to, tumor cells of 
all types (particularly melanoma, myeloid leukemia, carcinomas of the lung, breast, ovaries, colon, 
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tadney. prostate, pancreas and testes), cardiomyocytes, endothelial cells, epithelial oeDs. lymphocytes 
(TK^IIs and B cells) . mast cells, eosinophils, vascular intimal ceDs. hepatocytes. leukocytes including 
mononuclear leukocytes, stem cells such as haemopoetic. neural, skin. lung, kidney, liver and 
myocyte stem cells (for use in screening for differentiation and de^lifferentlation factors) osteoclasts 
chondrocytes and other connective tissue cells, keratinocytes. melanocytes, liver ceHs kidney cells ' 
and adipocytes. Suitable cells also include known research cells, including, but not limited to Jurkat 
T cells. IMIH3T3 cells. CHO. Cos. etc. Again, scafibkl proteins may be obtained from these sources 
as well. 



In a preferred embodiment, library proteins are expressed in bacterial systems, including bacteria in 
wh,ch the expression constructs are inboduced into the bacteria using phage. Bacterial expresston 
systems are well known io the art. and include BacUlas subtilis. E. coB. Streptococcus cremoris and 
Streptococcus IMdans 

In an altemate embodiment, library proteins are produced in insect cells, including but not limited to 
D/osqpMa melanogaster S2 cells, as well as cells derived from members of the order LepWopfera 
wh.ch includes all butterflies and moths, such as the silkmoth Bombyx mori and the alphalpha tooper 
Autogmpha califomica. Lepklopteran insects are host organisms for some members of a family of 
vims, known as baculoviruses (more than 400 known species), that infect a variety of arthropods 
(see U.S. 6,090.584). 

In an alternate embodiment, library proteins are produced in insect cells. The library can be 
transfected into SF9 Spodoptera frugiperda insect cells to generate bacutovirus which ara used to 
infect SF21 or High Five commercially available from Invitrogen. insect cells for high level protein 
production. Also, transfections into the Dnasc^jMa Schneider S2 cells will express proteins. 

In a preferred embodiment, library protein is produced in yeast ceMs. Yeast expression systems are 
we« known in the art. and include expression vectors for Saochan^myces cerevisiae, Candida 
albicans and C. maltosa, Hahsenula polymorpha, Kluyveromyces fmgilis and K. lacbs Pichia 
gurilenmondil and P.pastoris. Schizosaccharomyces pombe, and Yarmwia Upolytica. 

In one embodiment the library proteins are expressed In vitro using cell free franslation systems 
Several commen:ial sources are available for this including but not limited to Roche RapW Translatfon 
System. Promega TnT system. Novagen's EcoPro system. Ambion's ProteinScipt-Pro system In 
v,tro translation systems derived from both prokaryotic (e.g. £ coli) and eukaryotic (e.g Wheat gemi 
Rabbit reticulocytes) cells are availabte and can be chosen based on the expressfon levels and ' 
functional properties of the protein of interest. Both linear (as derived from a PGR ampllficatfon) and 
circular (as in plasmid) DNA molecules are suitebte for such expression as long as they contain the 
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gene encoding the protein operably linl<ed to an appropriate promoter. Other features of the nnolecule 
that are important for optimal expression in either the bacterial or eukaryotic cells (including the 
ribosome binding site etc) are also included in these constructs. The proteins can again be 
expressed individually or in suitable size pools consisting of multiple library members. The main 
advantage offered by these in vitro systems is their speed and ability to produce soluble proteins. In 
addition the protein being synthesized can be selectively labeled if needed for subsequent functional 
analysis. 

Transformation and transfection methods 

The methods of introducing exogenous nucleic acid into host cells is well known in the art, and will 
vary with the host cell used. Techniques include dextran- mediated transfection, calcium phosphate 
precipitation, calcium chloride treatment, polybrene mediated transfection, protoplast fusion, 
etectroporation, viral or phage Infectbn, encapsulation of the polynucteotide(s) In liposomes, and 
direct microinjection of the DNA into nuclei. In the case of mammalian cells, transfection may be 
either transient or stable. 

Expression Vectors 

A variety of expression vectors may be utilized to express the library proteins. The expression 
vectors are constructed to be compatible with the host cell type. Expression vectors may comprise 
self-replicattng extrachromosomal vectors or vectors whbh integrate into a host genome. Expression 
vectors typically comprise a library member, any fusion constructs, control or regulatory sequences, 
selectable markers, and/or additional elements. 

Prefenred bacterial expression vectors include but are not limited to pET, pBAD, bluescript, pUC, 
pQE, pGEX, pMAL, and the like. 

Preferred yeast expressbn vectors include pPICZ, pPIC3.5K, and pHIL-Sl commercially available 
from Invitrogen. 

Expression vectors for the transformation of insect cells, and in particular. baculovirus4}ased 
expression vectors, are well known in the art and are described e.g., in O'Reilly et at., Baculovirus 
Expression Vectors: A Laboratory Manual (New York: Oxford University Press, 1994). 

A prefenred mammalian expression vector system is a retroviral vector system such as is generally 
described in Mann et al., Cell, 33:153-9 (1993); Pear et al.. Proc. Natl. Acad. Sci. U.SA, 
90(18):8392-6 (1993); Kitamura et al.. Proc. Natl. Acad. Sci. U.S.A, 92:9146-50 (1995); Kinsella et 
al., Human Gene Therapy. 7:1405-13; Hofmann et aL.Proc. Natl. Acad. Sci. U.S.A., 93:5186-90; 
Choate et al.. Human Gene Therapy. 7:2247 (1996); PCT/US97/01019 and PCT/US97/01048. and 
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references cited therein, all of which are hereby expressly incorporated by reference. 
Inclusion of contro l or reaulatoiv sequences 

Generally, expression vectors include transcriptional and translational regubtory nucleic add 
sequences which are operably linked to the nucleic add sequence encoding the library protein. 

The transcriptional and translational regulatory nucleic add sequences will generally be appropriate to 
the host cell used to express the library protein, as will be appredated by those in the art For 
example, transcriptional and translational regulatory sequences from £ cotfare preferably used to 
express proteins in E coli. 

Transcriptional and translational regulatory sequences may indude. but are not limited to, promoter 
sequences, ribosomal binding sites, transcriptional start and stop sequences, translational start and 
stop sequences, and enhancer or activator sequences. In a preferred embodiment the regulatory 
sequences comprise a promoter and transcrq)tional and translational start and stop sequences. 

A suitable promoter is any nucleic add sequence capable of binding RNA polymerase and initiating 
the dovmstream (3') transcription of the coding sequence of library protein into mRNA. Promoter 
sequences may be constitutive or indudble. The promoters may be naturally occurring promoters, 
hybrid or synthetic pronraters. 

A suitable baderial promoter has a transcription initiation region which is usually placed proximal to 
the 5- end of the coding sequence. The transcription initiation region typically indudes an RNA 
polymerase binding site and a transcription initiation site. In £ coff. the ribosome-binding site is called 
the Shine-Dalgamo (SD) sequence and includes an initiation codon and a sequence 3-9 nucleotides 
in length located 3-11 nudeotides upstream of Hie initiation codon. Promoter sequences for 
metebolic pattiway enzymes are commonly utilized. Examples indude promoter sequences derived 
from sugar metabolizing enzymes, such as galactose, lactose and maltose, and sequences derived 
from biosynthetic enzymes such as tryptophan. Promoters from bacteriophage, such as ttie T7 
promoter, may also be used. In addition, synttietic promoters and hybrid promoters are also useful; 
for example, ttie fac promoter is a hybrid of ttie trp and lac promoter sequences. 

Preferred yeast promoter sequences indude the indudble GAL1.10 promoter, ttie promoters from 
alcohol dehydrogenase, enolase, glucokinase. glucose-6-phosphate isomerase, giyceraldehyde-3- 
phosphate-dehydrogenase. hexokinase. phosphofmdokinase. 3-phosphoglycerate mutase. pyruvate 
kinase, and the acid phosphatese gene. 

A suitable mammalian promoter will have a transcription initiating regbn. whch is usually placed 



76 



wo 03/014325 PCTAJS02/25588 

proximal to the 5' end of the coding sequence, and a TATA box, usually located 25-30 base pairs 
upstream of the transcription initiation site. The TATA box is thought to direct RNA polymerase II to 
begin RNA synthesis at the correct site. A mammalian promoter will also contain an upstream 
promoter element (enhancer element), typically located within 100 to 200 base pairs upstream of the 
TATA box. Typically, transcription temnination and polyadenylation sequences recognized by 
mammalian cells are regulatory regions located 3' to the translation stop codon and thus, together 
with the promoter elements, flank the coding sequence. The 3' terminus of the mature mRNA is 
formed by site-specific post-translational cleavage and polyadenylation. Examples of transcription 
terminator and polyadenylation signals include those derived firom SV40. An upstream promoter 
element determines the rate at which transcription is initiated and can act in eittier orientation. Of 
particular use as mammalian promoters are the promoters from mammalian viral genes, since the 
viral genes are often highly expressed and have a broad host range. Examples Include the SV40 
early promoter, mouse mammary tumor virus LTR promoter, adenovirus major late promoter, herpes 
simplex virus promoter, and the CMV promoter. 

Inclusion of a selectable marker 

In addition, In a preferred embodiment, the expression vector contains a selection gene or marker to 
allow the selection of transformed host cells containing the expression vector. Selection genes are 
well known in tiie art and will vary with the host cell used. 

For example, a bacterial expression vector may include a selectable marker gene to allow for Uie 
selection of bacterial strains Uiat have been ti^nsformed. Suitable selection genes include genes 
which render the bacteria resistant to drugs such as amplcillin, chloramphenicol, erythromycin, 
kanamycin, neomycin and tetracycline. 

Yeast selectable maricers include the biosyntiietic genes ADE2, HIS4, LEU2. and TRP1 when used in 
the context of auxotrophe strains; ALG7, which confers resistance to tunicamycin; the neomycin 
phosphotransferase gene, which confers resistance to G418; and tiie CUP1 gene, which allows 
yeast to grow in the presence of copper ions. 

Suitable mammalian selection markers include, but are not limited to, those that confer resistance to 
neomycin (or its analog G418), blasticidin S, histinidol D, bleomycin, puromycin. hygromycin B, and 
other drugs. Selectable markers conferring survivability in a specific media include, but are not limited 
to Blasticidin S Deaminase, Neomycin phophotranserase II, Hygromycin B phosphotranserase, 
Puromycin N-acetyl transferase, Bleomycin resistance protein (or Zeocin resistance protein, 
Phleomycin resistance protein, or phleomycin/zeocin binding protein), hypoxanthlne guanosine 
phosphoribosyl transferase (HPRT). Thymidylate synthase, xanthine-guanine phosphoridosyl 
transferase, and the like. 
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Inclusion of additional elements 

In additfon, the expression vector may comprise additional elemente. in a preferred emliodiment, ttie 
vector contains a ftisfon protein, as discussed below. In another embodiment, the expression vector 
may have two repHcation systems, thus allowing It to be maintained in two organisms, for example in 
mammalian or insect cells for expression and in a prol<aryotic host for cloning and ampBficatlon. 
Furthermore, for integrating expression vectors, the expression vector contains at least one sequence 
homotogous to the host cell genome, and preferably two homologous sequences which flanit the 
expression constmct The integrating vector may be directed to a specific locus in the host cell by 
selecting the appropriate homologous sequence for inclusion in the vector. Such vectors may include 
cre-bx recombination sites, or am. a«B. attP. and attL sites. Constructs for integrating vectors and 
appropriate selection and screening protocols are well known in the art and are described in e.g.. 
Mansour et al.. Cell. 51:503 (1988) and Murray. Gene Transfer and Expression Protocols. Meth^s in 
Molecular Biology. Vol. 7 (Clifton: Humana Press. 1991). In a preferred embodiment. Mie expression 
vector contains a RNA splicing sequence upstream or downstream of the gene to be expressed in 
order to increase the level of gene expression. (See Barret et al. . Nucleic Acids Res. 1 991 ; Groos et 
al.. IWol. Cell. Biol. 1987; and Budiman etal.. Mol. Cell. Biol. 1988.) 

Fusion Constructa 

The library protein may also be made as a fusion protein, using techniques well known in the art For 
example, fusion partners such as targeting sequences can be used which altow ttie localizatton of the 
library members into a subcellular or exti-acellular compartment of the cell. Purification tags may be 
fused witii a library, altowing ttie puriffcation or isolation of the library protein. Rescue sequences can 
be used to enable ttie recoveiy of ttie nucleic ackls encoding ttiem. Ottier fusion sequences are 
possible, such as fusions vihkh enable utilization of a screening or selection technology. 

Targeting or signal sequences 

The expression vector may also include a signal peptide sequence ttiat directs library protein and any 
assodated fusions to a desired ceUular tocation or to ttie extiacellular media. Suitable targeting 
sequences include, but are not Omited to, binding sequences capable of causing binding of ttie 
expression product to a predetemiined molecule or dass of molecules while retaining bioactivity of 
ttie expression product, (for example by using enzyme inhibitor or subsb^ate sequences to target a 
dass of relevant enzymes); sequences signalling selective degradation, of itself or co-bound proteins; 
and signal sequences capable of constitutively localizing ttie candkJate expression products to a 
predetermined cellular locale, induding a) subcellular tocations such as ttie Golgi. endoplasmic 
reticulum, nucleus, nudeoli. nudear membrane, mitochondria, chloroplast. secretory vesides. 
lysosome. and cellular membrane; and b) extracellular tocations via a secretory signal. Target 
sequences also may be used in conjunction vntti cefl surface display technology as discussed betow. 
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Particularly preferred is localization to either subcellular locations or to the outside of the cell via 
secretion. For example some targeting sequences enable secretion of library protein in bacteria. The 
signal sequence typically encodes a signal peptide comprised of hydrophobic amino acids which 
direct the secretion of the protein from the cell, as is well known in the art This method may be useful 
for gram-positive bacteria or gram-negative bacteria. The protein can be either secreted into the 
growth media or into the periplasmic space, located between the inner and outer membrane of the 
cell. 

Puriffcation tags 

In a preferred embodiment, the library member comprises a purification tag operably linked to the rest 
of the library peptide or protein. A purification tag is a sequence which may be used to purify or 
isolate the candidate agent, for detection, for immunoprecipitation, for FACS (fluorescence- activated 
cell sorting), or for other reasons. Thus, for example, purification tags include purification sequences 
such as polyhistldine, including but not limited to Hise. or other tag for use with Immobilized Metal 
AfRnity Chromatography (IMAC) systems (e.g. Ni^^ affinity columns), GST fusions, MBP fusbns. 
Strep-tag. the BSP bbtinylation target sequence of the bacterial enzyme BirA, and epitope tags which 
are targeted by antibodies. Suitable epitope tags include but are not limited to c-myc (for use with the 
commercially available 9E10 antibody), flag tag, and the tike. 

Rescue fusions 

A rescue fusion is a fusion protein which enables recovery of the nucleic acid encoding the library 
protein. In a preferred embodiment, such a rescue fusion would enable screening or selection of 
library members. Such fusion proteins may include but are not limited to, rep proteins, viral VPg 
proteins, transcription factors including but not limited to zinc fingers, RNA and DNA binding proteins, 
and the like. Attachment can be covalent or noncovalent 

Alternatively, the rescue sequence may be a unique oligonucleotide sequence that serves as a prot)e 
target site to allow the quick and easy isolation of the retroviral construct, via PGR, related 
techniques, or hybridizatbn. 

In an alternate embodiment, rescue sequences could also be based upon in vivo recombination 
systems, such as the cre-lox system, the Invitrogen Gateway system, forced recombination systems 
In yeast, mammalian, plant, bacteria or fungal cells (see WO 02/10183 A1), or phage display systems. 

In an alternate embodiment, display technologies are utilized. For example, in phage display (see 
Kay, BK et al, eds. Phage display of peptides and proteins: a laboratory manual (Academic Press, 
San Diego, CA, 1996); Lowman HB, Bass SH, Simpson N, Wells JA (1991) Selecting high-affinity 
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binding proteins by monovalent phage display. Bioechemistry 30:10832-10838; Smith GP (1985) 
RIanrentous fusion phage: novel expiession vectors that display doned antigens on the virion 
surface. Science 228:1315-1317.) library proteins can be fused to the gene Hi protein. Cell surface 
display (Witrrup KD. Protein engineering by cell-surface display. Curr. Opin. Biotechnology 2001, 
12:395-399.) may also be useful for screening. This includes but is not limited to display on bacteria 
(see Georgiou G. Poetschke HL. Stathopoulos C. Francisco JA, Practical applications of engineering 
gram-negative bacterial ceH surfaces. Trends Btatechnol. 1993 Jan;1 1(1):6-10; Georgiou G, 
Stathopoulos C. Daugherty PS. Nayak AR. Iverson BL. and Curtiss RR (1997) Display of 
heterologous proteins on the surface of microorganisms: from the screening of combinatorial libraries 
to live recombinant vaccines. Nature Biotechnol. 15. 29-34; Lee JS, Shin KS, Pan JG. Kim CJ. 
Surface-dispfayed viral antigens on Salmonella carrier vaccine. Nature Biotechnology. 2000 , 18:645- 
648; Jun et al. 1998). yeast (see Boder ET. Witlmp KD: Yeast surface display for screening ' 
combinatorial polypeptide libraries. Nat Biotechnol 1997. 15:553-557 Boder ET and Wittrup KD. Yeast 
surface display for directed evolution of protein expression, affinity, and stebility. Methods Enzymol 
2000. 328:430-44.). and mammalian cells (see Whitehom EA, Tate E. Yanofsky SD. Kochersperger 
L. Davis A. Mortensen RB. Yonkovfch S. Bell K. Dower WJ. and Barrett RW 1995. A generic method 
for expression and use of "tagged" soluble verstons of cell surface receptors. Bio/lechnotogy 13 
1215-1219.). 

Addition al fusions that allovy for screening or selection 

In an alternate embodiment, a protein fiagment complementation assay is used (see Johnsson N & 
Varshavsky A. Split UbiquiHn as a sensor of protein Interactfons in vivo. 1 994 Proc Natl Acad Sci 
USA. 91: 10340-10344; Pelletier JN. Campbell-Valois FX. Michnick SW. Oligomerization domain- 
directed reassembly of active dihydrofolate reductase from rattonally designed fragments. 1998. Proc 
Natl Acad Sci USA 95:12141-12146.) Other fusion methods wrhich may altow screening include but 
are npt limited to periplasmic expression and cytometric screening (see Chen G. Hayhurst A, Thomas 
JG, Harvey BR, Iverson BL. Georgiou G: Isolation of high-afTmity ligand-binding proteins by 
periplasm^ expressfon with cytometric screening (PECS). Nat Bfotechnol 2001, 19: 637-542.). and 
the yeast two hybrid screen (see FieWs S, Song O: A novel genetic system to detect protein-protein 
interactions. Nature 1989. 340:245-246.) 

Other fusions 

Additional fuston partners may also be utDized. For example, library protein may be made as a fusion 
protein to increase expression, increase solubility, confer stebility or protection from degradation, 
and/or confer ottier properties. For example, when raising monoclonal antibodies to a small epitope. 
Uie library protein may be fused to a carrier protein to form an immunogen. According to 
Varshavsky's N-End Rule, susceptibility to ubiquitination and subsequent degredation can be 
minimized by the incorporation of glycines after the initiation methwnine (MG or MGG). thus 
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conferring long halMife in the cytoplasm. Similarly^ adding two prolines to the C-terminus confers 
resistance to carboxypeptidase action. 

Linkers 

Linker sequences may be used to connect the library protein to its fusion partner or tag. The linker 
sequence will generally comprise a small number of amino acids, typically less than ten. However, 
longer linkers may also be used. As will be appreciated by those skilled in the art, any of a wide 
variety of sequences may be used as linkers. Typically, linker sequences are selected to be flexible 
and resistant to degradation. A common linker sequence comprises the amino ackl sequence 
GGGGS. The preferred linker between a protein and C-terminal PP tag consists of two glycines. 

Labels 

In one embodiment, the library nudeic acids, proteins and antibodies of the invention are labeled, in 
general, labels fall into three classes: a) immune labels, which may be an epitope incorporated as a 
fusion constructs may which is recognized by an antibody as discussed above, isotopic labels, whk:h 
may be radioactive or heavy isotopes, and c) small molecule labels which may include fluorescent 
and colorimetric dyes or molecules such as biotin which enable the use of other labeling techniques. 
Labels may be incorporated into the compound at any position and may be incorporated in vivo during 
protein or peptkje expression or in vitro. 

Protein Purification 

In a preferred embodiment, the library protein is purified or isolated after expression. Library proteins 
may be isolated or purified in a variety of ways known to those skilled in the art depending on what 
other components are present in the sample. The degree of purification necessary will vary 
depending on the use of the library protein. In some instances no purification will be necessary. For 
example in one embodiment, if library proteins are secreted, screening or selection can take place 
directly from the media. 

Standard purification methods Include eiectrophoretic, molecular, immunological and chromatographic 
techniques, including ion exchange, hydrophobic, affinity, size exclusion chromatography, and 
reversed-phase HPLC chromatography, as well as precipitation, dialysis, and chromatofocusing 
techniques. Purification can often be ^cilitated by the inclusion of purification tag, as described 
above. For example, the library protein may be purified using glutathione resin if a GST fusion is 
employed, Immobilized Metal Affinity Chromatography (IMAG) if a His or other tag is empbyed, or 
immobilized anti-flag antibody if a flag tag is used. Ultrafiltration and diaftltration techniques, in 
conjunction with protein concentration, are also useful. For general guidance in suitable purification 
techniques, (see Scopes, R., Protein Purification: Principles and Practice 3"^ Ed., Springer-Veriag, NY 
(1994).), hereby expressly incorporated by reference. 
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In a preferred embodiment, the libraries are used in any number of display techniques. For example 
the libraries may be displayed using phage or enveloped virus systems, bacterial systems, yeast two 
hybrid ^stems or mammalian systems. 

in a preferred embodiment, the libraries are displayed using a phage or enveloped virus system For 
example, a library of viruses, each carrying a distinct peptide sequence as part of the coat protein 
can be produced by inserting random oligonucleotides sequences into the coding sequence of vir^l 
coat or envelope proteins. Several different viral systems have been used to display peptides as 
described in Smith. G.P. (1985) Science. 228:1315-1317; Sanfini. C. etal.. (1998) J. Mol. Bid.. 
282:125-135; Stemberg. N. and Hoess. R.H. (1995) Proc. Natl. Acad. Sci. USA. 92:1609-1613- 
Maruyama.. I.N.. et al. (1994) Proc. Natl. Acad. Sd. USA, 91:8273-8277; Dunn. I.S.. (1995) J Mol 
Biol.. 248:497-506; Rosenberg. A., et al. (1996) Innovations 6:1-6); Ren. Z.J.. et al. (1996) Protein 
Sc.. 5:1833-1843; Efimov. V.P.. etal. (1995) Virus Genes 10:173-177; Dulbecco. R.. U.S. Patent No 
4.593.002; Ladner. R.C... et al.. U.S. Patent No. 5.837.500; Ladner. R.C.. et al.. U.S. Patent No 
5.223.409; Dower, et al.. U.S. Patent No. 5.427.908; RusseO et al.. U.S. Patent No. 5.723.287; Li U S 
Patent No. 6.190,856; and the application entitled "METHODS AND COMPOSITIONS FOR THE 
CONSTRUCTION AND USE OF ENVELOPE VIRUSES AS DISPLAY PARTICLES", filed August 2 
2001 . senal number not yet assigned, all of which are expressly incorporated by reference. 

in a preferred embodiment, the libraries are displayed on the surface of a bacterial cell as is descnDed 
in WO 97/37025. which is expressly incorporated by reference in its entirety. In this embodiment, 
surface anchoring vectors are provided for the surface expression of genes encoding proteins of 
interest At a minima, the vector includes a gene encoding an ice nucteadon protein, a secretion 
signal a targeting signal and a gene of interest PreliBrably. the bacterial host is a gram negative 
bacterium belonging to the genera Escherichia. Acetobacter. Pseudomonas. Xanthomonas. Erwinia 
and Xymomonas. Advantages to using the ice nucleation protein as the surface anchoring protein are 
the high level of expression of the Ice nucleation protein on the suriace of the bacterial cell and its 
stable expression during the stationary phase of bacterial ceU growth. 

In a preferred embodiment the libraries are displayed using yeast two hybrid systeiT« as is described 
in Fields and Song (1989) Nature 340:245. which is expressly incorporated herein by reference 
Yeast-based two-hybrid systems utilize chimeric genes and detect protein-protein interactions via the 
activabon of reporter-gene expression. Reporter-gene expression occurs as a result of reconstitufion 
of a functional transcription factor caused by the association of fusion proteins encoded by the 
chimeric genes. Preferably, the yeast two-hybrid system commercially available from Clontech is 
used to screen libraries for proteins that interact with a candidate proteins. See generally. Ausubel et 
al.. Current Protocols In Molecular Biotogy. John Wiley & Sons, pp.13.14.1-13.14.14. which is 
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in a preferred embodiment, the libraries are displayed using mammalian systems. For example, a 
cell-based display can be used to display large cDNA libraries in mammalian cells as described in 
Nolan, et al., U.S. Patent No. 6,163,380; Shioda , et al. U.S. Patent No. 6.251.676. both of which are 
expressly incorporated herein by reference. 

Screening of Libraries 
High-throucphput screening technology 

Fully robotic or microfluidic systems include automated liquid-, particle-, cell- and organism-handling 
including high throughput pipetting to peribrm all steps of experimental library generation, protein 
expression, and library screening. This includes liquid, particle, cell, and organism manipulations such - 
as aspiration, dispensing, mixing, diluting, washing, accurate volumetric transfers; retrieving, and 
discarding of pipette tips; and repetitive pipetting of identical volumes for multiple deliveries from a 
single sample aspiration. These manipulations are cross-contamination-free liquid, particle, ceil, and 
organism transfers. This instrument performs automated replication of microplate samples to filters, 
membranes, and/or daughter plates, high-density transfers, full-plate serial dilutbns, and high 
capacity operation. 

In addition, as will also be appreciated by those In the art, biochips may be part of the HTS system 
utilizing any number of components such as biosensor chips with protein arrays to measure protein- 
protein interactions or DNA-sensor chips to measure protein-DNA interactions. Microfluidic chip 
arrays (e.g., those commercially available from Caliper) may also be utilized in the context of 
automated HTS screening. 

The automated HTS system used can include a computer woricstation comprising a microprocessor 
programmed to manipulate a device selected from the group consisting of a thermocycler, a 
multichannel pipetter, a saniple handler, a plate handler, a gel loading system, an automated 
transformation system, a gene sequencer, a colony picker, a bead picker, a cell sorter, an incubator, a 
light microscope, a fluorescence microscope, a spectrofluorimeter, a spectrophotometer, a 
tuminometer, a CCD camera and combinations thereof. 

in vivo screening 

In a preferred embodiment, the library is screened using in vivo assay systems, including cell-based, 
tissue-based, or whole-organism assay systems. Cells, tissues, or organisms may be exposed to 
individual library members or pools containing several library members. Alternatively, host cells can 
be transformed or transfected with DNA encoding the library proteins and analyzed for phenotypic 
alterations. 
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To screen the library, experimental systerrre are developed in which the activity for the hbrary protein 
of interest is coupled to an observable property. Typical observable properties include changes in 
absorbance. fluorescence, or lunninescence. Screens may also monitor changes in properties such 
as cell morphology or viability. 

For example, cell death or viability can be measured using dyes or immuno^chemical reagents 
(e.g. Caspase staining assay for apoptosis. Alamar blue for cell vitality) that specifically recognize 
either viable or inviable cells. 

In an alternate cell death or viability assay, the cells are transfomied or bansfected with a receptor or 
binding partner protein responsive to the ligand represented by the library. The receptor may be 
coupled toa signaling pathway that causes cell death, allows cell survival, or triggers expression of a 
reporter gene. These readout modalities can be measured using dyes or immuno^y tochemical 
reagentsthat Indicate ceO death, ceil vitality (e.g. Caspase staining a ssay for apoptosis. Alamar blue 
for cell vitality). 

Alternatively, readout can be via a reporter construct Reporter constructs may be proteins that are 
mtnnsically fluorescent or colored, or proteins that modily the spectral properties of a substrate or 
binding partoer. Common reporter constructs include the proteins ludferase. green fluorescent 
protein, and beta-galactosidase. 

The assays described can also be perfom«d by measuring morphological changes of the ce«s as a 
response to the presence of a library variant These morphotogical changes can be registered using 
microscopic image analysis systems (e.g. Cellomics ArrayScan technology) such as those now 
available commercially. 

in vitro screening 

in a prefened embodiment different physical and functional properties of the library members are 
screened in an in vitro assay. Properties of library members that may be screened include, but are 
not limited to. various aspects of stability (including pH. thermal, oxidative/reductive and solvent 
stabilily). solubility, affinity, activity and specificity. Multiple properties can be screened 

simultaneously (e.g. substrate specificity in organic solvents, receptor-ligand binding at low pH) or 
individually. 

Protein properties can be assayed and detected in a wide variety of ways. Typical readouts include 
but are not limited to. chromogenic. fluorescent luminescent or isotopic signals. These detection ' 
modalities are utilized in several assay methods Including, but not limited to. FRET (fluorescence 
resonance energy transfer) and BRET (bioluminescence resonance energy transfer) based assays 
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AlphaScreen (Amplified Luminescent Proximity Homogeneous Assay), SPA (scintillation proximity 
assay), ELISA (enzyme-linked Immunosorbent assays), BIACORE (surface plasmon resonance), or 
enzymatic assays. In vitro screening may or may not utilize a protein fusion or a label. 

Selection of Libraries 

In an alternatively preferred embodiment, a selection method is used to select for desired library 
members. This is generally done on the basis of desired phenotypic properties, e.g. the protein 
properties defined herein. This is enabled by any method which couples phenotype and genotype, 
i.e. protein function with the nucleic add that codes for it. In some cases th is will be a "trans" effect 
rather than a "cis" effecL In this way, isolation of library protein variants simultaneously enables 
isolation of its coding nucleic acid. Once isolated, the gene or genes encoding library protein can be 
purified ("rescued") and/or amplified. This process of isolation and amplification can be repeated, 
allowing favorable protein variants in the library to be enriched. Nucleic acid sequencing of the 
selected library members ultimately allows for identification of library members with desired 
properties. 

Isolation of library protein can be accomplished by a number of methods. In some embodiments, only 
cells containing library protein variants with desired protein properties are allowed to survive or 
replicate. In alternate embodiments, the library protein and its genetic material are obtained by 
binding the library protein to another protein, RNA aptamer, or other molecule. 

In one embodiment, the selection method is based on the use of specific fusion constructs. For 
example, if phage display is used, the library members are fused to the phage gene III protein. 

In one embodiment selection is accomplished using a rescue fusion sequence, which forms a 
covalent or noncovalent link between the library member (phenotype) and the nucleic acid that 
encodes the library member (genotype). For example, in a preferred embodiment the rescue fusion 
protein binds to a specific sequence on the expression vector (see U.S.S.N. .09/642,574; 
PCT/USOO/22906; U.S.S.N. 10/023,208; PCT/US01/49058; U.S.S.N. 09/792.630; U.S.S.N. 
10/080.376; PCT/US02/04852; U.S.S.N. 09/792,626; PCT/US02/04853; U.S.S.N. 10/082.671; 
U.S.S.N. 09/953,351; PCTUS01/28702; U.S.S.N. 10/097,100; and PCT/US02/07466). and envelope 
virus (see U.S.S.N. 09/922.503 and PCT/US0 1/24535). 

In an alternate embodiment, selection is accomplished using a display technology including, but not 
limited to phage display, in which the library members are fused to a protein such as the phage gene 
HI protein, (see Kay, BK et al, eds. Phage display of peptides and proteins: a laboratory manual 
(Academic Press. San Diego, CA, 1996); Lowman HB, Bass SH, Simpson N, Wells JA (1991) 
Selecting high-affinity binding proteins by monovalent phage display. Bioechemistry 30:10832-10838; 
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Smith GP Filamentous fusion phage: novel expression vectors that display cloned antigens on the 
virion surface. (1985) Science 228:1315-1317.) and its derivatives such as selective phage infection 
(see Malmttorg AC. Soderlind E. Frost L. Borrebaeck CASelective phage infection mediated by 
epitope expression on F pflus. (1997) J Mol Biol 273:544^51.). selectively infective phage (see 
Krebber C. Spada S. Desplanq D. Krebber A. Ge L. Ruckthun A Selectively infective phage (SIP): a 
mechanistic dissection of a novel method to select for protein-ligand Interactions. (1 997) J Mol Biol 
268:619-630.). and delayed infecHvity panning (see Benhar I, Azriel R, Nahary L. Shaky S. 
Berdkjhevsky Y. Tamarkin A. Wels W (2000) Highly effteient selectfon of phage antibodies mediated 
by display of antigen as Lpp-OmpA' fusfons on live bacteria. J Mol Btol 301:893-904.). Other display 
technotogies. which could be used, include but are not limited to cell surface display (see Witrrup KD, 
Protein engineering by ceB^rfece display. Gun-. Opin. Biotechnology 2001, 12:395-399) such as 
display on bacteria (see Georgtou G, Poetschke HL. Stathopoubs C. Francisco JA. Practical 
applications of engineering gram-negative bacterial cell surfaces. Trends Biotechnd. 1993 
Jan:11(1):6-10; Georgiou G, Stathopoulos C. Daugherty PS. Nayak AR. Iverson BL. and Curtiss RR 
(1997) Display of heterotogous proteins on the surface of microorganisms: from the screening of 
combinatorial libraries to live recombinant vaccines. Nature Biotechnol. 15. 29-34; Lee JS. Shin KS, 
Pan JG, Kim CJ. Surface-displayed viral antigens on Salmonella carrier vaccine. Nature 
Biotechnology. 2000. 18:645-648; Jun HC. Lebeault JM. Pan JG. Surface display of Zymomonas 
mobilis ievansucrase by using the ice-nudeaffon protein of Pseudomonas syringae. Nat Biotechnol 
1998. 16:676-80.). yeast (see Boder ET and Wittrup KD. Yeast surface display for directed evolution 
of protein expression, affinity, and stability. Methods Enzymol 2000. 328:430^.; Boder ET. Wittrup 
KD: Yeast surface display for screening combinatorial polypeptWe libraries. Nat Biotechnol 1997, 
15:553-557). and mammalian cells (see Whitehom EA, Tate E, Yanofsky SD. Kbchersperger L, Davis 
A Mortensen RB. Yonkovich S. Bell K. Dower WJ. and Barrett RW (1995). A generic method for 
expression and use of "tagged" soluble versions of cell surface receptors. Bio/technotogy. 13, 1215- 
1219.). as well as in vitro display technologies such as polysome display (see Mattheakis LC, Bhatt 
RR. Dower WJ. Proc. Natl Acad Sci USA 1994. 91: 9022-9026; Hanes J and Pluckthun A Proc Natl 
Acad Sci USA 1997. 94:4937-4942.). ribosome display (see Hanes J and Pluckthun A Proc Natl Acad 
Sci USA 1 997. 94:4937-4942), mRNA display (Roberts RW and Szostak JW Proc Natl Acad Sci USA 
1997. 94. 12297-12302; Nemoto N. Miyamoto-Sato E. Husimi Y. Yanagawa H FEBS Lett 1997. 
414:405-408). and ribosome-inactivation display system (see Zhou J, Fujite S. Warashina M. Baba. T. 
Taira K J Am Chem Soc (2002). 124. 538-543.) 

In an alternate embodiment, in vitro selectton methods that do not rely on display technologies are 
used. These methods include but are not limited to periplasmic expression and cytometric screening 
(see Chen G. Hayhurst A. Thomas JG. Harvey BR. Person BL. Georgiou G: Isolation of high-affinity 
ligand-binding proteins by periptesmic expression with cytometric screening (PECS). Nat Biotechnol 
2001. 19: 537-542). protein fragment complementation assay (see Johnsson N & Varshavsky A. Split 
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Ubiquitin as a sensor of protein interactions in vivo. (1994) Proc Natl Acad Sci USA, 91: 10340- 
10344.) and the yeast two hybrid screen (see Fields S, Song O: A novel genetic system to detect 
protein-protein interactions. Nature 1989, 340:245-246.) used in selection mode (see Visintin M, Tse 
E, Axeison H, Rabbitts TH, Cattaneo A: Selection of antibodies for intracellular function using a two- 
hybrid in vivo system. Proc Natl Acad Sci USA 1999. 96: 11723-11728.). 

In an alternative embodiment, in vivo selection can occur if expression of the library protein imparts 
some growth, reproduction, or survival advantage to the cell. For example, if host cells transformed 
with a library comprising variants of an essential enzyme are grown in the presence of the 
corresponding substrate; only clones with a functional variant of the enzyme will survive. 
Altematively, an advantage may be oonfierred if the library member comprises a growth or survival 
factor and the host cell expresses the appropriate receptor. 

Additional Characterization 

In a preferred embodiment, a library member or members Isolated using some screening or selection 
method are further characterized. The library member(s) may be subjected to further biological, 
physical, structural, kinetic, and thermodynamic analysis. Thus, for example, a selected library 
variant may be subjected to physical-chemical characterization using gel electrophoresis, reversed- 
phase HPLC, SEC-HPLC, mass spectrometry (MS) including but not limited to LC-MS, LC-MS 
peptide mapping and the like, ultraviolet absorbance spectroscopy, fluorescence spectroscopy, 
circular dichroism spectroscopy, isothermal titration calorimetry, differential scanning calorimetry, 
surfiace plasmon resonance, analytical ultra-centrrfugatlon, proteolysis, and cross-linking. Structural 
analysis employing X-ray crystallographtc techniques and nuclear magnetic resonance spectroscopy 
are also useful. As is known to those skilled in the art. several of the above methods can also be 
used to detemnine the kinetics and thermodynamics of binding and enzymatic reactions. The 
biological properties of one or more library members, Including pharmacokinetics and toxk^ity, can 
also be characterized in cell, tissue, and whole organism experiments. 

Expression vectors 

Using the nucleic acids of the present inventbn, which encode library members, a variety of 
expression vectors are made. The expression vectors may be either self-replicating 
extrachromosomal vectors or vectors which integrate into a host genome. 

Nucleic ackl is operably linked when it is placed into a functional relationship with another nucleic acid 
sequence. For example, DNA for a presequence or secretory leader is operably linked to DNA for a 
polypeptide if it is expressed as a preprotein that participates in the secretion of the polypeptide; a 
promoter or enhancer is operably linked to a coding sequence if it affects the transcription of the 



87 



PCT/US02/25588 

sequence; or a ribosome binding site is operably linked to a cxxiing sequence if it is positioned so as 
to fadOtate translation However, enhancers do not have to be contiguous. 



inclusion of control or regulatoiy sequences 

Generany. these expression vectors include transcriptional and translational regulatory nucleic acid 
operably linked to ttie nucleic acid encoding the library protein. 



The transcriptional and translational regulatory nucleic ackl will generally be appropriate to the host 
oeB used to express the library protein, as will be appreciated by ttiose in ttie art; for example, 
transcriptional and translational regulatory nudeic add sequences from BacUlus are preferably used 
to express the library protein in Backus. Numerous types of appropriate expresston vedors, and 
suitable regulatory sequences are known in tiie art for a variety of host cells. 

In general, the transcriptional and translational regulatory sequences may include, but are not limited 
to. promoter sequences, ribosomal binding sites, transcriptional start and stop sequences, 
translational start and stop sequences, and enhancer or activator sequences. In a preferred 
embodiment, the regulatory sequences indude a promoter and transcriptional start and stop 
sequences. 

Promoter sequences indude constitutive and indudbte promoter sequences. The promoters may be 
naturally occuning promoters, hybrid or synthetic promoters. Hybrid promoters, whteh combine 
elements of more than one promoter, are also known in ttie art. and are useful in the present 
invention. 



Inclusion of a selectable markerts) 

In addition, in a preferred embodiment, ttie expression vector contains one or more selectebte genes 
or parts of selectabte marker genes to allow the setection of transformed host cells containing the 
expression vedor. and particularty in the case of mammalian cells, ensures ttie stability of ttie vector 
since cells which do not contain ttie vector wfll generally die. Selection genes are well known in ttie ' 
art and will vary witti ttie host cell used. 

The bacterial expression vector may also indude at least one selectable mariner gene(s) to allow for 
ttie setedion of baderial strains ttiat have been transformed. Suitebte seledable gene(s) or parts of 
seledable marker genes, indude genes, which render ttie baderia resistant to dnjgs sudi as 
ampfciUin. diloramphenrcol, eryttiromycin. kanamycin. neomydn and tetracydine. Selectabte mari«rs 
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also Include biosynthetic genes, such as those in the histidlne, tryptophan and leucine biosynthetic 
pathways. 

Inclusion of additional elements 

In a preferred embodiment, the expression vector contains a RNA splicing sequence upstream or 
downstream of the gene to be expressed in order to increase the level of gene expression. See 
Barret et al., Nucleic Acids Res. 1991; Groos et a!., Mol. Cell. Biol. 1987; and Budiman et al., Mol. 
Cell. Bbl. 1988. 

In addition, the expression vector may comprise additional elements. For example, the expression 
vector may have two replication systems, thus allowing it to be maintained in two organisms, for 
example in mammalian or insect cells for expression and in a prokaryotic host for cloning and 
amplification. Furthermore, for integrating expression vectors, the expression vector contains at least 
one sequence homologous to the host cell genome, and preferably two homologous sequences which 
flank the expression construct. The integrating vector may be directed to a specific locus in the host 
cell by selecting the appropriate homologous sequence for inclusion in the vector. Such vectors may 
include cre-iox recombination sites, or attR, atB, a/fP, and affL sites. Constructs for integrating 
vectors and appropriate selection and screening protocols are well known in the art and are described 
in e.g. , Mansour et al., Ce//, 51 :503 (1 988) and Murray, Gene Transfer and Expression Protocols, 
Methods in Molecular Biology, Vol. 7 (Clifton: Humana Press, 1991). 

Constructs 

Targeting or signal sequences 

The expression vector may also include a signal peptide sequence that provides for secretion of the 
library protein in bacteria. The signal sequence typically encodes a signal peptide comprised of 
hydrophobic amino acids which direct the secretion of the protein from the cell, as is well known in the 
art. The protein is either secreted into the growth media (gram-positive bacteria) or into the 
periplasmic space, located between the inner and outer membrane of the cell (gram-negative 
bacteria). 

Thus, suitable targeting sequences include, but are not limited to, binding sequences capable of 
causing binding of the expression product to a predetermined molecule or class of molecules while 
retaining bioactivity of the expression product, (for example by using enzyme inhibitor or substrate 
sequences to target a class of relevant enzymes); sequences signaling selective degradation, of itself 
or co-bound proteins; and signal sequences capable of constitutively localizing the candidate 
expresskjn products to a predetermined cellular locale, including a) subcellular locations such as the 
Golgi, endoplasmic reticulum, nucleus, nucleoli, nuclear membrane, mitochondria, chtoroplast, 
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secretory vesicles, lysosome. and cellular membrane; and b) extraceHular locations via a secretory 
signal. Particularly prefemed is localization to either subcellular locations or to the outside of the cell 
via secretion. 



ID (Purification) Tags 

In a preferred embodiment, the library member comprises a rescue sequence operably linl<ed to the 
rest of the peptide or protein. A rescue sequence Is a sequence which may be used to purify or 
isolate either the candidate agent or the nucleic add encoding it Thus, for example, peptide rescue 
sequences include purification sequences such as polyhistidines. including but not limited to the Hisg. 
and the like or other tag fbr use with Ni*^ affinity columns and epitope tags for detection, 
immunopredpitation or FACS (fluorescence-activated cell sorting). Suitable epitope tags indude c- 
myc (for use with the commerdally avaUable 9E10 antibody), the BSP biotinylation target sequence of 
the bacterial enzyme BirA flu tags, lacZ, and GST. 

A rescue sequence could also be a nucleic add sequence operably linked to an epitope in a 
covalently attached protein, or a protein that specifically recognizes the nucleic acid. Such sequences 
indude. but are not limited to. most sequence spedfic RNA and DMA binding proteins, preferably 
those that recognize specific sequences or structures, and the like. 

Altematively. the rescue sequence may be a unkjue oBgonucleotide sequence that serves as a probe 
target site to allow the quick and easy isolation of the construct, via PCR. related techniques, or 
hybridization. 

In a preferred embodiment, rescue sequences could also be based upon in vivo recombination 
systems, such as tiie cre-lox system, the Invitirogen Gateway™ system, forced recombination 
systems In yeast, mammalian, plant bacteria or fungal cells (for example WO 02/10183 A1), or phage 
display systems. 



Fusion constructs 

» 

The library protein may also be made as a fusion protein, using techniques well known in the art 
Thus, for example, for the creation of monodonal antibodies, if the desired epitope is smaH. the fibrary 
protein nray be fused to a earner protein to form an immunogen. Altematively. the library protein may 
be made as a fusion protein to increase expression, or for other reasons. For example, when the 
library protein is a library peptide, ttie nucleic add encoding the peptide may be linked to other nucleic 
add for expression purposes. Similarly, ottier fusion parhiers may be used, such as targeting 
sequences whfch altow the kxalization of the library members into a subcellular or extracellular 
compartment of the cell, rescue sequences or purification tags whfch allow the purification or isolation 
of either the library protein or ttie nucleic adds encoding then^ stability sequences, whfch confer 
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stability or protection from degradation to the library protein or the nucleic acid encoding it. for 
example resistance to proteolytic degradation, or combinations of these, as well as linker sequences 
as needed. 

In a preferred embodiment, the fusion partner Is a stability sequence to confer stability to the library 
member or the nucleic acid encoding it. Thus, for example, peptides may be stabilized by the 
inoorporatfon of glycines after the initiation methionine (MG or MGGO), for protection of the peptide to 
ubiquitination as per Varshavsky's N-End Rule, thus conferring long half-life in the cytoplasm. 
Similariy, two prolines at the C- terminus Impart peptides that are largely resistant to carboxypeptidase 
action. The presence of two glycines prior to the prolines impart both flexibility and prevent structure 
initiating events in the diiDroline to be propagated into the candidate peptide stmcture. Thus, 
preferred stability sequences are as follows: MG(X)nGGPP, where X is any amino acid and n is an 
integer of at least four. 

Labeling (isotoplc. fluorescent affinity) 

In one embodiment the library nucleic acids, proteins and antibodies of the invention are labeled. By 
"labeled" herein is meant that nucleic acids, proteins and antibodies of the invention have at least one 
element, isotope or chemical compound attached to enable the detection of nucleic acids, proteins 
and antibodies of the invention. In general, labels fall into three classes: a) isotopic labels, which may 
be radioactive or heavy isotopes; b) affinity labels, which may be antibodies or antigens; and c) 
colored or fluorescent dyes. The labels may be incorporated into the compound at any position. 

Expression systems 

The library proteins of the present invention are produced by culturing a host cell transformed with 
nucleic acid, preferably an expression vector, containing nucleic acid encoding an library protein, 
under the appropriate conditions to induce or cause expression of the library protein. As outlined 
below, the libraries may be the basis of a variety of display techniques, including, but not limited to. 
phage and other viral display technologies, yeast, bacterial, and mammalian display technologies. 
The conditions appropriate for library protein expression will vary with the choice of the expression 
vector and the host cell, and will be easily ascertained by one skilled in the art through routine 
experimentation. For example, the use of constitutive promoters in the expressbn vector will require 
optimizing the growth and proliferation of the host cell, while the use of an Inducible promoter requires 
the appropriate growth conditions for induction. In addition, in some embodiments, the timing of the 
harvest is important For example, the baculoviral systems used in insect cell expression are lytic 
viruses, and thus harvest tinne selection may be crucial for product yield. 
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As Will be appredated by those in the art. the type of cells used in the present invenfion may vary 
wideiy. Basically, a wide variety of appropriate host cells may be used, including yeast, bacteria, 
archaebacteria, fungi, and insect and animal cells, including mammaDan cells. Of particular interest 
are Drosophila melanogaster cells, Saccharomyces cerevisiae and other yeasts, £ coli, Bacillus 
subtais, SF9 cells, C129 cells, 293 cells, Neurospora. BHK. CHO, COS, and HeLa cells, fibroblasts. 
Schwanoma cell lines, immortalized mammalian myeloid and lymphoid cell lines, Jurkat cells, mast 
cells and other endocrine and exocrine cells, and neuronal cells. See the ATCC cell line catalog, 
hereby expressly incorporated by reference. In addition, the expression of the secondary libraries in 
phage display systems, such as are well known in the art, are particularly preferred, especially when 
the secondary library comprises random peptides. In one embodiment, the cells may be genetically 
engineered, that is, contain exogenous nucleic acid, for example, to contain target molecules. 

Mammalian expression systems 

In a prefen-ed embodiment, the library proteins are expressed in mammalian ceUs. Any mammalian 
cells may be used, with mouse, rat. primate and human cells being particularly prefened, although as 
will be appreciated by those in the art. modifications of the system by pseudotyping allows all 
eukaryotic cells to be used, preferably higher eukaryotes. As is more fully described below, a screen 
will be set up such that the cells exhibit a selectable phenotype in the presence of a random library 
member. As is more fully described below, cell types implicated in a wkfe variety of disease 
conditions are particularly useful, so long as a suitable screen may be designed to allow the selectton 
of cells that exhibit an altered phenotype as a consequence of the presence of a library member 
within the cell. 

Accordingly, suitable mammalian cell types include, but are not limited to. tunror cells of ail types 
(particularly melanoma, myeloid leukemia, carcinomas of the lung, breast, ovaries, colon, kidney, 
prostate, pancreas and testes), cardlomyocytes, endothefial cells, epithelial cells, lymphocytes (T-cell 
and B cell) , mast cells, eosinophils, vascular Intimal cells, hepatocytes, leukocytes including 
mononuclear leukocytes, stem cells such as haemopoietic. neural, skin. lung, kidney, liver and 
myocyte stem cells (for use in screening for differentiation and de-differentiation factors), osteoclasts, 
chondrocytes and other connective tissue cells, keratinocytes, melanocytes, liver cells, kidney cells, 
and adipocytes. Suitable cells also include known research cells, including, but not limited to, Juricat 
T cells, NIH3T3 cells, CHO, COS, etc. See the ATCC ce« line catalog, hereby expressly incorporated 
by reference. 
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Mammalian expression systems are also known in the art. and include retroviral systems. A 
mammalian promoter is any DMA sequence capable of binding mammalian RNA polymerase and 
initiating the downstream (3') transcriptbn of a coding sequence for library protein into mRNA. A 
promoter will have a transcription-initiating regbn, which is usually placed proximal to the 5* end of the 
coding sequence, and a TATA box, usually located 25-30 base pairs upstream of the transcription 
initiation site. The TATA box is thought to direct RNA polymerase II to begin RNA synthesis at the 
correct site. A mammalian promoter will also contain an upstream promoter element (enhancer 
element), typically located within 100 to 200 base pairs upstream of the TATA box. An upstream 
promoter element determines the rate at which transcription is initiated and may act in either 
orientation. Of particular use as mammalian promoters are the promoters from mammalian viral 
genes, since the viral genes are often highly expressed and have a broad host range. Examples 
include the SV40 early promoter, mouse mammary tumor virus LTR promoter, adenovirus major late 
promoter, herpes simplex virus promoter, and the CMV promoter. 



Typically, transcription termination and polyadenylation sequences recognized by mammalian cells 
are regulatory regions located 3' to the translation stop codon and thus, together with the promoter 
elements, flank the coding sequence. The 3' terminus of the mature mRNA Is formed by site-specific 
post-translational cleavage and polyadenylation. Examples of transcription terminator and 
polyadenylation signals include those derived from SV4G. 



The methods of introducing exogenous nudek; acid into mammalian hosts, as well as other hosts, is 
well known in the art, and will vary with the host cell used. Techniques Include dextran-nnediated 
transfection, calcium phosphate precipitation, polybrene mediated transfection, protoplast fusion, 
electroporation, viral infection, encapsulation of the polynucIeotide(s) in liposomes, and direct 
microinjection of the DNA into nuclei. 



Bacterial expression systems 

In a prefen'ed embodiment, library proteins are expressed in bacterial systems. Bacterial expression 
systems are well known in the art. 



A suitable bacterial promoter is any nucleic acid sequence capable of binding bacterial RNA 
polymerase and initiating the downstream (3*) transcription of the coding sequence of library protein 
into mRNA A bacterial promoter has a transcription initiation region which is usually placed proximal 
to the 6' end of the coding sequence. This transcription initiation region typically includes an RNA 
polymerase binding site and a transcription Initiation site. Sequences encoding metabolic pathway 
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enzymes provide particularly useful promoter sequences. Examples include promoter sequences 
derived from sugar metabolizing enzymes, such as galactose, lactose and maltose, and sequences 
derived from bfosynthetic enzymes such as tryptophan. Pronrwters from bacteriophage may also be 
used and are knovim in the art In addition, synthetic pronwters and hybrid promotere are also useliil; 
for example, the tac promoter is a hybrid of the trp and fee promoter sequences. Furthemmre, a 
bacterial promoter may include naturally occurring promoters of non-bacterial origin that have the 
ability to bind bacterial RNA polymerase and initiate transcription. 

In addition to a functioning promoter sequence, an eflident ribosome-binding site is desirabte. In £ 
coli, the ribosome-binding site is called the Shine-Dalgarno (SD) sequence and includes an initiation 
codon and a sequence 3-9 nucleotides in length located 3-11 nucleotides upstream of the initiation 



Baculovirus expression system 



In one embodiment, library proteins are produced in insect cells. Expression vectors for the 
transfonmaf on of insect cells, and in particular, baculovirus-based expressfon vectors, are well knov^n 
in the art and are described e.g.. in O'Reilly et al., Baculovirus Expression Vectors: A Lammtory 
Afe/it/a/ (New York: Oxford University Press. 1994). 



Yeast expression systems 

In a preferred embodiment, library protein is produced in yeast cells. Yeast expression systems are 
well known in the art. and include expression vectors for Saccliaromyces cerevisiae. Candida 
albicans and C. maltosa, Hansenula polymorpha, IQuyveromyces fragilis and K. lactis, Pichia 
gutUerimondnand P.pastoris. Schizosaccharomyces pombe, and Yarrowia lipolytica. Preferred 
promoter sequences for expression In yeast include the inducible GAL1 .10 promoter, the promoters 
from alcohol dehydrogenase, enolase. glucokinase, glucose-6-phosphate isomerase. glyoeraMehyde 
3-phosphate^iehydrogenase. hexokinase. phosphofructokinase, 3-phosphoglycerate mutese, 
pymvate kinase, and the acid phosphatase gene. Yeast selectable markers include ADE2, HIS4. 
LEU2. TRP1. and ALG7. which confers resistance to tunicamydn; the neomycin phosphotransferase 
gene, which confers resistance to G418; and the CUP1 gene, which allows yeast to grow in the 
presence of copper bns. 



in Vitro Expression systems 

In one embodiment, the library proteins are expressed in vifro using cell-free translaSon systems. 
Several commerciai sources are available for this system including but not limited to Roche Rapid 
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Translation System, Promega TnT system, Novagen's EcoPro system, Ambion's ProteinSci pt-Pro 
system. In vitro translation systems derived from both prokaryotic (e.g. £. co//) and eukaryotic (e.g. 
Wheat germ, Rabbit reticulocytes) cells are available and may be chosen based on the expression 
levels and functional properties of the protein of interest. Both linear (as derived from a PGR 
amplification) and circular (as in plasmid) DNA molecules are suitable for such expression as long as 
they contain the gene encoding the protein operably linked to an appropriate promoter. Other 
features of the molecule that are important for optimal expression in either the bacterial or eukaryotic 
cells (including the ribosome binding site etc) are also included in these constructs. The proteins may 
again be expressed individually or in suitable size pools consisting of multiple library members. The 
main advantage offered by these in vitro systems is their speed and ability to produce soluble 
proteins. In addition the protein being synthesized may be selectively labeled if needed for 
subsequent functional analysis. 

Protein purification 

in a preferred embodiment, the library protein is purified or Isolated after expression. Library proteins 
may be isolated or purified in a variety of ways known to those skilled in the art depending on what 
other components are present in the sample. Standard purification methods include eiectrophoretic, 
molecular, immunological and chromatographic techniques, including ion exchange, hydrophobic, 
affinity, and reverse-phase HPLC chromatography, and chromatofbcusing. For example, the library 
protein may be purified using a standard anti-library antibody column. Uitrafiltratbn and diafiltration 
techniques, in conjunction with protein concentration, are also useful. For general guidance in 
suiteble purification techniques, see Scopes, R., Protein Purification, Springer- Verlag, NY (1982). 
The degree of purification necessary will vary depending on the use of the library protein. In some 
instances no purification will be necessary. 

> • 

Screening of iibrary members 

Library members may be screened using a variety of assays, including but not limited to in vitro 
assays, and in vivo assays such as cell-based, tissue-based, and whole-organism assays. 
Automation and high-throughput screening technologies may be utilized in the screening procedures. 

Ceil-based assays - euicarvotic and prol<arvotic 

In a prefenred embodiment, the iibrary is screened using cell-based assay systems. 
in VIVO selection of library variants 

Host cells transformed with a library representing variants of an enzyme or resistance factor of 
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inteiBst are grown in the presence of the corresponding substrate or antibiotic. Only clones with a 
functional variant of the enzyme or resistance factor will survive. 



Screening based on ceil survival, cell death or expression of reporter oenes In cells 

Cells are exposed to individual variants or pools of variants belonging to a library to be assayed. The 
cells are transfonned or transfected either transiently or stably with the corresponding receptor 
responsive to the ligand represented by the library. The receptor is coupled to a signaling pathway 
that either causes cell death, cell survival, or triggers expression of a reporter gene. These readout 
modalities may be measured using dyes or Immuno-cytochemical reagents that indicate ce« death, 
cell vitality (e.g. Caspase staining assay for apoptosis, Alamar blue for cell vitality), or in case of the 
reporter constructs enzymes that convert dyes and cause them to be luminescent (e.g. luciferase) or 
shift their absorbance or fluorescent properties to wavelengths different from their properties before 
conversion. 



Screening based on ce ll survival of individual clones or clone pools 

Host cells are transformed or transfected with library DNA representing variants of a ligand or 
receptor of interest. The cells are also transformed or transfected either transiently or stably with the 
con-esponding receptor responsive to the ligand represented by the library or in case of a receptor 
library with ligand signaling through the receptor represented by the librBry. The receptor is coupled 
to a signaling pathway that causes cell survival. If the sequence of the variant causing cell survival is 
not pre-identlfied. surviving cell clones may be used to identify the sequence Identity of the 
conesponding variant. 



Screenin g based morphological changes of cells 

All of the above described assay readouts rely on changes that may be measured using absorbance. 
fluorescence or luminescence readers. The assays described may also be read measuring 
morphological changes of the cells as a response to the presence of a library variant These 
morphological changes may be registered using microscopic Image analysis systems (e.g. Celiomics 
ArrayScan technology) now available commercially. 



Screenin g based on candidate bloactive agents 

Candidate agents are obtained from a wide variety of sources, as will be appreciated by those in the 
art. including libraries of synthetic or natural compounds. As will be appreciated by those in the art. 
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the present invention provides a rapid and easy method for screening any library of candidate agents, 
including the wide vanety of loiown combinatorial chemistry -tyF>e libraries. 

In a preferred embodiment, candidate agents are synthetic compounds. Any number of techniques 
are available for the random and directed synthesis of a wide variety of organic compounds and 
biomolecules, including expression of randomized oligonucleotides. See for example WO 94/24314, 
hereby expressly Incorporated by reference, which discusses methods for generating new 
compounds, including random chemistry methods as well as enzymatic methods. As described in 
WO 94/24314, one of the advantages of the present method is that it is not necessary to characterize 
the candidate bioactive agents priorjto the assay; only candidate agents that bind to the target need 
be identified. In addition, as is known in the art, coding tags using split synthesis reactions may be 
done, to esseritially Identify the chemical moieties on the beads. 

Alternatively, a preferred embodiment utilizes libraries of natural compounds in the form of bacterial, 
fungal, plant and animal extracts that are available or readily produced, and can be attached to beads 
as is generally known in the art. 

Additionally, natural or synthetically produced libraries and compounds are readily modified through 
conventional chemical, physical and biochemical means. Known pharmacological agents may be 
subjected to directed or random chemical modifications, including enzymatic modifications, to produce 
structural analogs. 

In a preferred embodiment, candidate bioactive agents include proteins, nucleic acids, and chemical 
nrusieties. 

In a preferred embodiment, tiie candidate bbactive agents are proteins. In a preferred embodiment, 
the candidate bioactive agents are natijrally occurring proteins or fragments of naturally occurring 
proteins. Thus, for example, cellular extracts containing proteins, or random or directed digests of 
proteinaceous cellular extracts, may be attached to beads as is more fully described below. In this 
vt^y libraries of procaryotic and eucaryotic proteins may be made for screening against any number of 
targete. Particularly preferred in this embodiment are libraries of bacterial, fungal, viral, and 
mammalian proteins, with the latter being prefened, and human proteins being especially preferred. 

In a preferred embodiment, the candidate btoactive agents are peptides of from about 2 to about 50 
amino acids, with from about 5 to about 30 amino adds being preferred, and from about 8 to about 20 
being particularly preferred. The peptides may be digests of naturally occurring proteins as is outiined 
above, random peptides, or "biased" random peptides. By'randomized" or grammatical equivalents 
herein is meant that each nucleic acid and peptide consists of essentially random nucleotides and 
amino acids, respectively. Since generally ttiese random peptides (or nucleic acids, discussed below) 

97 



wo 03/014325 



PCT/US02/25588 



c 



are chemically synthesized, they may incorporate any nucleotide or amino acid at any position. The 
synthetic process can be designed to generate randomized proteins or nudeic adds, to allow the 
formation of afl or most of the possible combinations over the length of the sequence, thus fbnning a 
fibrary of randomized candidate bioactive proteinaceous agents. In addition, the candidate agents 
may themselves be the product of the invention; that is, a library of proteinaceous (andtdata agents 
may be made using the methods of the invention. 

High-throu ghput screening technology 

Fully robotic or microfluidic systems indude automated fiquid- . partide-. cell- and organism-handllng 
induding high throughput pipetting to perform all steps of gene targeting and recombination 
applications. This indudes liquid, partide. cell, and organism manipulations such as aspiration, 
dispensing, mixing, diluting, wrashing. accurate volumetric transfers; retrieving, and discarding of 
pipette tips; and repetitive pipetting of identical volumes for multiple deliveries from a single sample 
aspiration. These manipulations are crossK»ntamination-free liquid, partide. ceil, and organism 
transfers. This instalment performs automated replication of micropiate samples to filters, 
membranes, and/or daughter plates, high-density transfers, fiill-plate senal dilutions, and high 
capacity operation. 



In addition, as will also be appreciated by those in the art biochips may be part of the HTS system 
utilizing any number of components such as biosensor diips with protein arrays to measure protein- 
protein interactions or DMA-sensor dJips to measure protein-DNA interactions. IVIicrofluidic diip 
an^ys (e.g.. tedinoiogy developed by Caliper) may also be utilized in the context of automated HTS 
screening. 



The automated HTS system used may indude a computer workstation comprising a microprocessor 
programmed to manipulate a device seieded from the group consisting of a thermocycier. a 
multidiannel pipetter. a sample handler, a plate handler, a gel loading system, an automated 
bansformation system, a gene sequencer, a colony pidter. a bead picker, a cell sorter, an incubator, 
light microscope, a fluorescence mfcrascope. a speclrofluorimeter. a spectrophotometer, a 
luminometer. a CCD camera and combinations thereof. 



a 



In vitro assay s 



In a preferred embodiment, different physical and functional properties of the library members are 
screened in an in vitro assay. In vitro assays allow a broader dynamic range for screening protein 
properties of interest that are not limited by cellular viability of the cells expressing the library 
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members or library members acting upon other cells to exert its effects. Properties of library members 
that may be screened include, but are not limited to. various aspects of stability (including pH, 
tliermal, oxidative/reductive and solvent stability), solubility, affinity, activity and specificity. Multiple 
properties may be screened simultaneously (e.g. substrate specificity in organic solvents, receptor- 
ligand binding at low pH) or individually. 

Protein properties may be assayed and detected in a wide variety of ways. Modality of detection 
could Include, but are not limited to, chromogenic, fluorescent, luminescent, or isotopic substrates for 
protein library members. Any of these detection modalities are utilized in several assay methods 
including, but not limited to, FRET (fluorescence resonance energy transfer) and BRET 
(bioluminescence resonance energy transfer) based assays, AlphaScreen (Amplified Luminescent 
Proximity Homogeneous Assay). SPA (scintillation proximity assay), ELISA (enzyme-linked 
immunosorbent assays), or enzymatic assays. 

Additional characterization 

In a prefenred embodiment, a library member or members isolated from a cell positively selected for 
any numt>er of protein properties by in-vivo or in-vitro screening methods well known to those in the 
art, are further characterized for said properties by aforementioned screens or other methods 
including physical, structural, kinetic, and thermodynamic analysis. Thus, for example, a selected 
library variant may be subjected to physk:al characterization through gel electrophoresis, reverse- 
phase HPLC, MS, LC-MS. RP-HPLC, SEC-HPLC, LOMS peptide mapping, CD. analytical ultra- 
centrifugation, and proteolysis. Structural analysis employing X-ray crystallographic technk|ues, 
NMR, and cross-linking are also useful. In addition, thermodynamic and kinetic characterization of 
proteinaceous nnoieties are well known in the art 

EXAMPLES 

The following examples serve to more fully describe the manner of using the above-described 
invention, as well as to set forth the best modes contemplated for carrying out various aspects of the 
invention. It is understood that these examples in no way serve to limit the true scope of this 
invention, but rather are presented for illustrative purposes. All references cited herein are 
incorporated by reference. 

Example 1 

Computational Prescreening on P-iactamase TEM-1 

Experiments were performed on the p-lactamase gene TEM-1 . Brookhaven Protein Data Bank entry 
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1 BTL was used as the starting structure. All water molecules and the SO/ group were removed and 
expDcit hydrogens were generated on the structure. The stmcture was then minimized for 50 steps 
without electrostatics using the conjugate gradient method and the Dreiding II force field. These steps 
were performed using the BIOGRAF program commercially available from Molecular Simulations. 
Inc.. San Diego. CA. This ininimized structure served as the template for all the protein design 
calculations. 

Computational Snrgpninf] 

Computational screening of sequences was performed using PDA™ technology. A 4 A sphere was 
drawn around the heavy side chain atoms of the four catalytic residues (S70. K73, S130. and E168) 
and all amino adds having heavy side chain atoms within this distance cutoff were selected. This 
yielded the following 7 positions: F72. Y105. N132. N136. L169. N170. and K234. Two of these 
residues. N132 and K234, are highly conserved across several different 3^actamases and were 
therefore not included in the design, leaving five variable residue positions {F72. Y106. N136. L169. 
N170). TTiese designed positions were allowed to change their identity to any of the 20 naturally 
occurring amino acids except proline, cysteine, and glycine (a total of 17 amino acids). Proline is 
usually not allowed since it is difficult to define appropriate rotamers for proline, cysteine is excluded 
to prevent formation of disultide bonds, and glycine is excluded because of conformafional flexibifity. 

Additionally, a second set of residues within 5 A of the five residues selected for PDA™ technology 
design were floated (their amino acid identity was retained as wild type, but their conformation was 
allowed to change). The heavy side chain atoms were again used to detennine wrfilch residues were 
within the cutoff. This yielded the following 28 positions; M68. M69. S70. T71. K73. V74. L76, VI 03 
E104. S106. P107. 1127. M129. S130. A135. L139. L148. L162, R164. W1 65. El 66. P167 D179 
M211. D214. V216. S235. 1247. The two prolines. P107 and P167. were excluded from the floated 
residues, as were positions M69. R164. and W165. since their crystal stmclures exhibit highly 
strained rotameis. leaving 23 ftoated residues from the second set. Also. A248 was included instead 
of 1247. The conserved residues N132 and K234 from the first sphere (4 A) were also floated, 
resulting in a total of 25 floated residues. 

The potential functions and parameters used in the PDA™ technology calculations were as follows. 
The van der Waals scale factor was set to 0.9. and the electrostatic potential was calculated using a 
distance dependent dielectric of e=40 R. The well depth for the hydrogen bond potential was set to 8 
kcal/mol with a local and remote backbone scale factor of 0.25 and 1.0 respectively. The solvation 
potential was only calculated for designed positions classified as core (F72. L169. M68. T71. V74 
L76. 1127. A135. L139. L148. L162. W121 1 and A248). Type 2 soh^ation was used (Street and Mayo. 
1998). The non-polar exposure multiplication factor was set to 1.6. the non-polar burial energy was 
set to 0.048 kcal/mol/A^ and the polar hydrogen burial energy was set to 2.0 kcal/mol. 
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The Dead End Elimination (DEE) optimization niethod (see reference) was used to find the lowest 
energy, ground state sequence. DEE cutoffs of 50 and 100 kcal/mol were used for singles and 
doubles energy calculations, respectively. 

Starting from the DEE ground state sequence, a Monte Carlo (MC) calculation was performed that 
generated a list of the 1000 lowest energy sequences. The MC parameters were 100 annealing 
cycles with 1 ,000,000 steps per cycle. The non-productive cycle limit was set to 50. In the annealing 
schedule, the high and low temperatures were set to 5000 and 100 K respectively. 
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The following probability distribution was then calculated from the top 1000 sequences in the MC list 
(see Table 31 below). It shows the number of occunBnces of each of the amino acids selected for 
each position (the 5 variable residue positions and the 26 floated positions). 

Table 1: Monte Carlo analysis (amino adds and their number of occurrences at the designed 
positions resulting from the MC Bst of the 1000 lowest energy ranked sequences. 



105 M:183 Q:142 1:132 N:129 E:126 S:115 D:97 A:76 



POSITION / 


69 


1 M:1000 


70 


S:1000 


71 


T:1000 


72 


Y:591 


73 


K:1000 


74 


V:1000 


76 


L:1000 


103 


V:1000 


104 


E:1000 


105 


M:183 


106 


S:1000 


127 


1:1000 


129 


M:1000 


130 


3:1000 


132 


N:1000 


135 


A- 1000 


136 


D:530 


139 


L:1000 


148 


L:1000 

• 


162 


LIOOO 


166 


ElOOO 



AMINO ACID: OCCURRENCES 



F:365 V:35 E8 L:1 



136 D:530 M:135 N:97 V:68 E66 8:38 T:33 A:27 Q:6 
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169 



170 



179 
211 
214 
216 
234 
235 
248 



L:698 E:156 M:64 S:37 D:23 A:21 Q:10 

M:249 L:118 E:113 D:112 T:90 Q:87 S:66 R:44 A:36 



N:24 

D:1000 

M:1000 

D:1000 

V:1000 

K:1000 

S:1000 

A: 1000 



F:21 K:15 Y:9 H:9 V:8 



103 



wo 03/014325 



PCT/US02/25S88 



This probability distribution was then transformed into a rounded probability distribution (see Table 2) 
A 10% cutoff value was used to round at the designed positions and the wild type amino acids were 
forced to occur with a probabifity of at least 10%. An E was found at position 169 15.6% of the time 
However, since this position is adjacent to another designed position, 170. its ctoseness would have 
requfred a more complicated oligonucleotide library design; E was therefore not included for this 
position when generating the sequence library (on|y L was used). 



posmoN 




RESIDUEff>ROBABIL[TY 



Y 50% 
F 60% 
1 20% 
N 10% 
E 10% 
S 10% 
Y10% 



M20% 
Q 20% 
N 10% 



D 70% 
M 20% 



L 100% 

E 20% 
D 20% 
N 10% 



M30% 
L 20% 



As seen from Table 2. the computational pre-screening resulted in an enomious reduction in the size 
of the problem. Origina«y, 17 different amino acids were allowed at each of the 5 designed positions 
giving 17^= 1.419,857 possible sequences. This was pared down to just 210 possible sequences - 
a reduction of nearly four orders of magnitude. 



Generation of Sequence Library 

Overlapping oligonucleotides corresponding to the full length TEM-1 gene for p-iactamase and all 
desired mutations were synthesized and used in a PGR reaction as described previously (Figure 1). 
resulting in a sequence library containing the 21 0 sequences described above. 
Synthesis of mutant TEM-1 genes 

To allow the mutation of the TEM-1 gene. pCR2.1 (commercially available trom Invitrogen) was 
digested with Xbal and EcoRI. blunt ended with T4 DNA polymerase, and religated. This removes the 
Hindlll and Xhol sites within the polylinker. A new Xhol sKe was then introduced into the TEM-1 gene 
at position 2269 (numbering as of the original pCR2.i) using a Quickchange Site-Directed 
Mutagenesis Kit as described by the manufacturer (commercially available from Stratagene). 
Similarly, a new Hindlll site was introduced at position 2674 to give pCR-Xenl. 
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To construct the mutated TEM-1 genes, overlapping 40-mer oligonucleotides were synthesized 
corresponding to the sequence between the newly introduced Xhol and Hindlll sites, designed to 
allow a 20-nucleotide overlap with adjacent oligonucleotides. At each of the designed positions (72. 
105, 136 and 170) multiple oligonucleotides were synthesized, each containing a different mutation so 
that all the possible combinations of mutant sequences (210) could be made in the desired 
proportions as shown in Table 3. For example, at position 72, two sets of oligonucleotides were 
synthesized, one containing an F at position 72, the other containing a Y. Each oligonucleotide was 
resuspended at a concentration of Ipg/pl. and equal molar concentrations of the oligonucleotides 
were pooled. 

At the redundant positions, each oligonucleotide was added at a concentration that reflected the 
probabilities in Table 3. For example, at position 72 equal amounts of the two oligonucleotides were 
added to the pool, while at position 136, twice as much M-encoding oligonucleotide was added 
compared to the N-containing oligonucleotide, and seven times as much D-containing oligonucleotide 
was added compared to the N-contalning oligonucleotide. 

DNA library assembly 

For the first round of PGR, 2 \i\ of pooled oligonucleotides at the desired probabilities {Table 3) were 
added to a 100 pi reaction that contained 2 pMO mM dNTPs, 10 |jl lOx Taq buffer (commercially 
available from Qiagen), 1 pi of Taq DNA polymerase (5 units/pl: commercially available from Qiagen) 
and 2 pi Pfu DNA polymerase (2.5 unlts/pl: commercially available from Promega). The reaction 
mixture was assembled on ice and subjected to 94°C for 5 minutes, 15 cycles of 94**C for 30 seconds, 
52''C for 30 seconds and 72**C for 30 seconds, and a final extension step of 72^C for 10 minutes. 

Isolation of full-length oligonucleotides 

For the second round of PGR, 2.5 pi of the first round reaction was added to a 100 pi reaction 
containing 2 pi 10 mM dNTPs, 10 pi of 10x Pfu DNA polymerase buffer commercially available from 
Promega, 2 pi Pfu DNA polymerase (2.5 units/pi: commercially available from Promega), and 1 pg of 
oligonucleotides con-esponding to the 5' and 3' ends of the synthesized gene. The reaction mixture 
was assembled on ice and subjected to 94*G for 5 minutes. 20 cycles of 94 QG for 30 seconds, 52X 
for 30 seconds and 72X for 30 seconds, and a final extension step of 72°G for 10 minutes to isolate 
the full length oligonucleotides. 

Purification of DNA library 

The PGR products were purified using a QIAquick PGR Purification Kit commercially available from 
Qiagen, digested with Xho1 and Hindlll, electrophoresed through a 12 % agarose gel and re-purified 
using a QIAquick Gel Extraction Kit commercially available from Qiagen. 

Verification of Sequence Library Identity 



105 



PCT/US02/25588 

The PCR products containing the library of mutant TEM-1 p-laclamase genes were then cloned 
between a promoter and terrrunator in a kananrjydn resistant plasmid and transformed into E coff. An 
equal number of bacteria were then spread onto media containing either kanamydn orampicillln. All 
transformed colonies will be resistant to kanamydn. but only those with active mutated P-lactamase 
genes wyi grow on ampicillin. After overnight incubation, several colonies were observed on both 
plates, indicating that at least one of the above sequences encodes an active ^ctamase. The 
number of colonies on the kanamydn plate far outnumbered those on the ampidllin plate (roughly a 
5:1 ratk)) suggesting that either some of the sequences destroy activity, or that the PCR introduces 
errors that yield an inactive or truncated enzyme. 

To distinguish between these possibaities. 60 colonies were prcked from the kanamydn plate and 
their plasmki DNA was sequenced. This gave the distribution shown in Table 3. 

Table 3: Percentages predk:ted by PDA™ technology vs. those observed from experiment for the 
designed positions. 



Wild Type PDA™ Technology Residues (Predicted Percentage/Observed Percentage) 



72F 


Y50/50 


F50/50 












105Y 


M20/27 


Q20/18 


120/21 


N 10/7 


E10/7 


SI 0/10 


Y10/10 


136N 


D70/72 


M20/17 


N10/11 










170N 


M 30/34 


L 20/21 


E 20/21 


D 20/1 7 


N10/7 







Note that the observed percentages of each amino acid at aU four positions ctosely match the 
predicted percentages. Sequencing also revealed that only one of the 60 colonies contained a PCR 
error, a G to C transitton. 

This small test demonstrates that multiple PCR with pooled oligonudeotldes may be used to construct 
a sequence library that reflect the desired proportions of amino add changes. 

Experimental Screening of Sequence Librarv 

The purified PCR product containing the library of mutated sequences was then ligated into pCR- 
Xen1 that had prevfously been digested vnth Xho1 and Hindlll and purified. The ligation reaction was 
transformed into conrjpetent TOP10 £ coU cells (Invitrogen). After allowing the cells to recover for 1 
hour at 37'C. the cells were spread onto LB plates containing the antibiotic cefotaxime at 
concentratfons ranging firom 0.1 pg/ml to 50 pg/ml and selected for increasing resistance. 

A triple mutant was found that improved enzyme function by 35-fold in only a single round of 
screening (see Figure 4). This mutant (Y105Q. N136D, N170L) survived at 50 pg/ml cefotaxime. 
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Example 2 

Secondary Library generation of a Xyianase 

PDA^" technology Screening Leads to Enormous Reduction in Number of Possible Sequences 

To demonstrate that computational screening is feasible and will lead to a significant reduction in the 
number of sequences that have to be experimentally screened, calculations for the 6. circulans 
xyianase with and without the substrate were performed. The PDB structures 1XNB of free 6. 
circulans xyianase and 1 BCX for the enzyme substrate complex were used. 27 residues inside the 
binding site were visually identified as belonging to the active site. 8 of these residues were regarded 
as absolutely essential for the enzymatic activity. These positions were floated (see Figure 2). This 
means that they could change their side chain conformation but not their amino acid identity. 

Three of the 20 naturally occuning amino acids were not considered (cysteine, proline, and glycine). 
Therefore, 17 different amino acids were still possible at the remaining 19 positions; the problem 
yields 17^^ = 2.4 xlO^^ different amino acid sequences. This number is 10 orders of magnitude larger 
than what may be handled by state of the art directed evolution methods. Clearly these approaches 
cannot be used to screen the complete dimensionality of the problem and consider all sequences with 
multiple substitutions. Therefore PDA™ technology calculations were performed to reduce the 
sequence space. Starting from the PDA™ technology ground state a list of 10,000 low energy 
sequences was created by Monte Carlo and the probability for each amino acid at each position was 
determined (see Table 4). 



Table 4: Probability of amino acids at the designed positions resulting from the PDA technology 
calculation of the wild type (WT) enzyme structure. Only amino acids with a probability greater than 1 
% are shown. 



WT 


PDA^" Technology Probability Distribution 


5Y 


Wl.2% 


F5.8% 


Y2.9% 


H4.0% 








7Q 


E9.1% 


LO.2% 












1.1 D 


11.2% 


DO.7% 


VO.1% 


M7.9% 


L6.4% 


E5.3% 


T4.2% 






Q3.8% 


Y2.6% 


F2.1% 


Ni.g% 


81.9% 


A1.1% 


37 V 


D9.9% 


M9.4% 


VI .4% 


S2.8% 


14.1% 


E1.0% 




39 G 


A9.8% 














63 N 


Wl.2% 


Q6.7% 


A1 .4% 










65 Y 


El, 7% 


L4.9% 


M3.4% 










67T 


E1.0% 


D2.3% 


L3.9% 


A1.7% 








71 W 


V7.8% 


F5.5% 


W8,5% 


M6.0% 


05.8% 


E4.3% 


11.0% 


80 Y 


M2.4% 


LI. 5% 


F9,0% 


15.9% 


Y5.7% 


E3.7% 




82 V 


V8.6% 


D1.0% 
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88 Y 


N1.1% 


K6.6% 


W1.3% 


HOT 


D9.9% 




115A 


A5.6% 


Y7.8% 


T4.4% 


118E 


E22% 


D2.6% 


. 12.0% 


125 F 


F9.4% 


Y1.8% 


M7.3% 


129 W 


E1.3% 


S8.6% 




168 V 


D8.1% 


A1.0% 




170 A 


A8.7% 


S7.6% 


D3.7% 



D10.2% 
A1.7% 
LI. 5% 



S9.2% 



F2.6% 



If we consider all the amino adds obtained from the PDA™ technology calculation, including those 
with probabilities less than 1%, we obtain 4.1 x 10^^ different amino acid sequences. This is a 
reduction by 7 orders of magnitude. If one only considers those amino acids that have at least a 
probability of more than 1% as shown in Table 1 (1% criterion), the problem is decreased to 6.6 x 10^ 
sequences. If one neglects all amino acids with a probability of less than 5% (5% criterion) there are 
only 4.0 x 10^ sequences left This is a number that may be easily handled by screening and gene 
shuffling techniques. Increasing the list of low energy sequences to 100.000 does not change these 
numbers significantly and the effect on the amino acids obtained at each position is negfigible. 
Changes occur only among the amino acids with a probability of less than 1%. Including the 
substrate In the PDA^" technology calculation further reduced the number of amino acids found at 
each position. If we consider those amino acids with a probability higher than 5%, we obtain 2.4 x 10® 
sequences (see Table 5). 



l^ble 5: ProbabHtty of amino acids at the designed positions resulting from the PDA™ 
J^K?^^ calculation of the enzyme substrate complex. Only those amino acids with a 
probability greater than 1% are shown. 



WT 



5Y 

7Q 

11D 

37V 

39G 

63N 

65Y 

67T 

71W 

SOY 

82V 

88Y 



PDA™ TECHNOLOGY PROBABILITY DISTRIBUTION 



Y69.2% W17.0% H 7.3% 
Q78.1% E18.0% 



D97.1% 

V50.9% 

S80.6% 

W92.2% 

E91.1% 

E92.6% 

VV62.6% 

M66.4% 

V86.0% 



D33.9% 
A19.4% 
D 3.9% 
L 8.7% 
L 5.2% 
El 3.3% 
F13.6% 
D12.8% 
Y16.9% 



L 3.9% 

S 5.4% 
Q 2.9% 



Ml 1.0% 
El 0.7% 



F 6.0% 



A 1.2% 



S6.9% 
I 6.0% 



L1.0% 



D4.0% 
LI .3% 



N11.4% F9.6% K1.9% Q1.4% D1.4% Ml. 4% 
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designed sequences. Wang Y, Zhang, H, Scott. RA. A new computational model for protein folding 
based on atomic solvation. Protein Sci. 1995 Jul;4(7):1402-11.; Kuhlman B, Baker. D. Native protein 
sequences are close to optimal for their structures. Proc Natl Acad Sci U S A. 2000 Sep 12; 97(19): 
10383-8; Street AG, Datta, D, Gordon, DB, Mayo, SL. Designing protein beta-sheet surfaces by Z- 
score optimization. Phys Rev Lett 2000 May 22; 84(21 ):501 0-3.; Gordon, DB, Marshall, SA, Mayo, 
SL Energy functions for protein design. Curr Opin Struct Biol. 1999 Aug; 9(4): 509-13. Review. 



In a preferred embodiment, one or nrpre scoring functions are optimized or "trained*' during the 
computational analysis, and then the analysis re-run using the optimized system. For example, the 
results of PDA^^ technology calculations, described below, performed on decoy structures, may be 
used to obtain optimal sets of scoring function weights. First, the various components of a force field 
are factorized within the computer algorithm. A starting set of parameters, or weights, is defined 
based on best guess or previous knowledge of the parameter space. The cunrent parameter set is 
used in conjunction with a protein design algorithm to design one or more protein sequences and 
structures. These generated structures are then treated as decoy structures. The optimal set of 
parameters is considered to be that which predicts that decoys with properties very different from the 
reference stmcture (native structure or prototype structure) are high in energy. In a preferred 
emiXKiiment of the invention, a set of equations, relating the calculated energies of each decoy and 
comparison of each of its energy components to the reference, is used to iteratively optimize the 
parameters. 



In a prefenred embodiment of the invention, the parameterization simulation begins with the creation 
of a number of decoy structures (e.g. 1 00-200) using random scoring function weights (within a 
predefined range), and a computational protein design algorithm. 



In a prefenred embodiment, parameters are modified at each iteration cycle according to the following 
equation: 



mod,, = ;i * ^^^f ^-^Kp^ 



d 



which details the modification of weight i based on evaluation of decoy d. E| d and E|,n represent the 
values of the ith scoring function component for the decoy and reference structures. Pd represents 
the normalized Bottzmann probability of the decoy structure according to its total energy using the 
current weights. The equation may be interpreted as follows. If a decoy structure's ith energy 
component is higher in value than that of the native (Ei,d vs. Ei,n). the weight of the ith parameter is 
increased to an extent related to the difference in energy components. The extent of increase is 
further related to the cunrent probability (Pd) of the decoy structure - only high probability (low energy) 
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decoy structures contribute. The change in parameter will thus by definition lead to an increase in the 
calcuteted energy of the decoy relative to the reference (higher energy or score = less favorable). 
The value of X detemtines the rate at which the parameters are varied. TTie equation is applied to 
each decoy in the set Because the probabilities of the decoys are dy namicalV related to the change 
in parameters, multiple iterations over the current decoy set (see below) are applied: 

mod,., = A*^^'-J~f'"K * 

In a preferred embodiment, once the paranrjeterization cycle begins, new decoys are generated with 
the modified parameters. This ensures broad enough coverage of parameter space by creating a 
broad range of decoy stnjctures, and leads to the creation of ever-increasing competition between 
decoys and reference. However, because the actively created decoys will, at some point in the cycle, 
become high quality structures, an additional term may be added to the parameter modification 
scheme: 

In a preferred embodiment, the parameterization is performed independently on a number of protein 
target structures. Parameterization using a number of small protein structures has levealed, 
importantly, that optimal parameters derived from one protein correlate strongly with those derived 
from different proteins. This result indicates that the invention yields parameter sets that are 
appUcable to a wide variety of proteins. In an additional aspect, the parameter optimization method is 
appHed separately to sets of proteins that exhibit a common desired property (e.g. high solubility, 
thermostabirity). In this manner, force field parameters may be specifically trained to design proteins 
with desired properties, such as tiiermostabilHy, solubility, and the like. 

A key discovery associated with the method is that nahiral proteins appear to have highly conserved 
relative amounts of various energy components (polar group contacts, hydrogen bonding energy, 
etc.). Thus, although the invention is conveniently applied using the native shuclure as a reference 
state, analysis of multiple natural proteins has indicated ttiat a prototypfcal protein may readily be 
defined and utilized as a reference state. 

Rnally. a diversity of related scoring function weights may be applied in separate applications of a 
protein design cycle, such that a diversity of sequence solutions are derived. 

In a preferred embodiment, ttie scoring functions ouUined above may be biased or weighted in a 
variety of ways ttiat does not involve "training'. For example, a bias towards or away from a reference 
sequence or family of sequences may be incorporated; for example, a bias towards wild-type or 
homologue residues may be used. Similarty. the entire protein or a fragment tiiereof may be biased; 
for example, the active site may be biased towards wiW-type residues. A bias towards or against 
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increased energy may be generated. Furthermore, biases may be used to design in selectivity. For 
example, a bias against sequences that bind to one or more unwanted substrates or receptors may 
be used. Additional scoring function biases include, but are not limited to applying electrostatic 
potential gradients or hydrophobicity gradients, and biasing towards a desired charge, isoelectric 
point, or hydrophobicity. In addition, experimental data, which may include values for any protein 
property or properties, may be used to generate biases or weights. 

Once the scoring functions to be used are Identified for each variable position, the preferred first step 
in the computational analysis is the determination of the interaction of each possible rotamer with all 
or part of the remainder of the protein. That is, the energy of interaction, as measured by one or more 
of the scoring functions, of each possible rotamer at each variable position (or each variable and 
floated position) vrith the backbone and/or other rotamers, is calculated. In a preferred embodiment, 
the interaction energy of each rotamer with the entire remainder of the protein, i.e. both the entire 
template and all other rotamers, is calculated. However, as outlined above, it Is also possible to 
model only a portion of a protein, for example, a domain, motif, or site in a larger protein. 

In a preferred embodiment two sets of interaction energies are calculated for each side chain rotamer 
at every position: the interaction energy between the rotamer and the template or backbone (the 
"singles" energy), and the interaction energy between the rotamer and all other possible rotamers at 
every other positfon (the "doubles" energy), whether that position is varied or floated. It should be 
understood that the template in this case includes both the atoms of the protein structure backbone, 
as well as the atoms of any fixed residues, as well as non-protein atoms in the scaffold. In an 
alternate embodiment, singles and doubles energies are calculated for fixed positions as well as for 
variable and floated positions. 

Some energy terms, such as the secondary structure propensity scoring function, may be a 
component of the singles energy only. As will be appreciated by those In the art, many of the doubles 
energy terms will be close to zero, as many of the energy terms depend on the physical distance 
between the first rotamer and the second rotamer. That is, the farther apart the two moieties, the 
lower the energy typically will be. Furthermore, energy terms are not typically calculated for atoms 
that are separated by less than three, or alternatively less than four, covalent bonds. 

Once the singles and doubles energies are calculated and stored, the next step of the computational 
processing may occur: the identification of one or more sequences that have a low energy or 
favorable score. Alternatively, energies may be calculated as needed during the optimization steps, 
although this is often less computationally efficient. 

CONIBINATORIAL OPTIIVIIZATION ALGORITHMS 
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The discrete nature of rotemer sets allows a simple calculation of the nunriber of possible rotameric 
sequences for a given design problem. A backbone of length n with m possible rotamers per position 
win have m" possible rotamer sequences, a number which grows exponentially with sequence length. 
For very simple design calculations, it is possible to examine each possble sequence in order to 
identify the optimal sequence and/or one or more favorable sequences. However, for a typical design 
problem, the number of possible sequences (up to 10»° or more) is sufficiently laige that examination 
of each possible sequence is intractable. A variety of combinatorial optimization algorithms may then 
be used to identify the optimum sequence and/or one or more favorable sequences. 

Combinatorial optimization algorittims may be divided into two classes: (1) those that are guaranteed 
to retum the global minimum energy conf^uratfon if tiiey converge, and (2) those Biat are not 
guaranteed to retum the global minimum energy configuration, but which will always retum a solution. 
Examples of the first class of algorithms indude. but are not limited to. Dead- End Elimination (DEE) 
and Branch & Bound (B&B) (including Branch and Terminate) (see Gordon and Mayo. Structure Fold. 
Des. 7:1089-98. 1999), and examples of «ie second dass of algoritiims include, but are not limited to. 
Monte Carlo (MC). self-consistent mean field (SCMF), Bollzmann sampling, simulated annealing, 
genetic algorithm (GA) and Fast and Accurate Side-Chain Topology and Energy Refinement 
(FASTER). 

Combinatorial optimization algorittims may be used alone or in conjunctfon with each other. 
Strategies for applying combinatorial optimization algorithms to protein design problems include, but 
are not limited to. (1) Find ttie global minimum energy configuration. (2) Find one or more low-energy 
or favorable sequences, and, most prefBrred. (3) Find the gtobal minimum energy configuration and 
then find one or more low-energy or favorable sequences. For example, as outiined in U.S.S.N. 
09/127.926 and PCT US98/07254. preferred embodiments utilize a Dead End Elimination (DEE) step, 
and preferably a Monte Cario step. 

In a prefen-ed embodiment when scoring is used, the prinrary Bbrary comprises tiie optimum 
sequence. That is. computational processing is run until the simulation program converges on a 
single sequence which is Mie global optimum. In a preferred embodiment, the primary library 
comprises at least two optimized protein sequences. Thus for example, tiie computational processing 
step may eliminate a number of disfiavored sequences but be stopped prior to convergence, providing 
a library of sequences of which the global optimum is one. In addHion. further computational analysis, 
for example using a different method, may be run on the library, to further eliminate sequences or 
rank them differentiy. Alternatively, as is more fully described in U.S. Patent Nos. 6.269.312. 
6.188.965. and 6,403.312. which are herein expressly incorporated by reference, the global optimum 
may be reached, and then further computational processing may occur, which generates additional 
optimized sequences. 
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It should be noted that the preferred methods of the invention result in a rank-ordered iist or filtered 
set of sequences; that is, the sequences are ranked or filtered on the basis of some objective criteria. 
However, as outlined herein, it is possible to create a set of non-ordered sequences, for example by 
generating a probability table directly that lists sequences without ranking them. 

In a prefen'ed embodiment, an algorithm that is guaranteed to return the global minimum free energy 
configuration (GMEC) is used. However, such algorithms are not guaranteed to converge to a 
solution In a tractable amount of time. That is, the algorithm may get stuck. As a result, alternate 
strategies may be required for some design problems. 

The DEE cak^ulation is based on the assumption that if the worst total interaction of a first rotamer is 
still better than the best total interaction of a second rotamer, then the second rotamer cannot be part 
of the global minimum energy configuration. An additional aspect of DEE states that if the energy of a 
rotamer sequence can always be lowered by changing from a first rotamer to a second rotamer, the 
first rotamer cannot be part of the global minimum. Since the energies of all rotamers have already 
been calculated, the DEE approach only requires sums over the sequence length to test and 
eliminate rotamers, which speeds up the calculations considerably. DEE may also include steps in 
which pairs of rotamers, or combinations of rotamers, are compared in order to Identify* sets of 
rotamers that are not compatible with the global minimum free energy configuration. In order to use 
DEE, the energy or scoring function must be painvise-decomposable. That is, the energies or scores 
must be a function of the conformation and/or identity of at most two rotamers. 

In the B&B or A* algorithm, a tree is built, where a rotamer is first picked for one position, then a 
second position, and so on until one complete rotameric sequence is generated. The energy for that 
rotameric sequence is then cateulated or obtained from the results of an eariier energy calculation. 
The process is then repeated, adding additional branches to the tree. However, in subsequent steps, 
if at any point the energy of the partially constructed rotameric sequence is worse than the energy of a 
previously identified complete rotameric sequence, all sequences that contain that partial rotameric 
sequence may be eliminated. The process may be completed until the GMEC Is klentifled. 
Additionally, B&B may be used to generate a list of all sequences that are within some energy or 
score of the GMEC (Gordon and Mayo. Structure Fold. Des. 7:1089-98, 1999) (Leach and Lemon, 
Proteins 33(2): 227-239, 1 998). As for all the techniques listed herein, these algorithms can be used 
to generate a primary library or a secondary computational library. 

In an alternate embodiment, combinatorial search algorithms that are not guaranteed to return the 
GMEC may be used, either alone or following identification of the GMEC. These algorithms may also 
be referred to as sampling techniques. Algorithms that do not return the GMEC are typically 
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computationally efficient and converge to a solution or solutions in a tractable, predldable amount of 
time. However, the quality of the solutions returned using these algorithms is variable, and may 
sometimes be insufficient These sampling methods may include the use of amino acid substitutions, 
insertions or deletions, or recombinations of one or more sequences. 

Sampling techniques use a variety of approaches to jump between different points in sequence space 
(that is, between different possible variant sequences). For an sampling techniques, the kinds of 
allowable jumps may be altered (for example, jumps to random residues, jumps biased away from the 
wild type sequence, jumps biased towards similar residues, jumps where multiple residue positions 
are simultaneously changed, etc). After jumping to a new sequence, the algorithm will dioose 
whether to accept or reject the jump. The acceptance criteria for each sampling jump may be altered, 
by modifying the temperature factor. As will be appreciated by those skilled in the art, high 
temperature factors aUow searches across a broad area of sequence space, and low temperature 
factors aHow searches over a narrow region of sequence space. See Metropolis et al., J. Chem Phys 
v21, pp 1087, 1953, hereby expressly incorporated by reference. 

A preferred embodiment utilizes a Monte Cario search, which is a series of biased, systematic, or 
random jumps in sequence space. Monte Carto searching may be used to explore sequence space 
around the global minimum, to find new local minima distant in sequence ^ce . or to find one or 
more low energy sequences. A Monte Carlo search may be performed to generate a rank-ordered Hst 
or filtered set of sequences in the neighborhood of the GMEC. Starting at the GMEC, random 
positions are changed to other resklues or rotamers (that is. the confbrmatfon and/or identity is 
changed), and the energy of the new sequence is calculated. If the new sequence meets the criteria 
for acceptance, it is used as a starting point for another jump. After a predetermined number of 
jumps, a rank-ordered list or filtered set of sequences Is generated. Monte Cario searches may also 
be started at sequences that are not the GMEC. including randomly selected sequences. Such 
searches may be used to generate a list of favorable sequences when the GMEC is not known. 

In another embodiment, self-consistent mean field CSCMF) methods (see Delarue et al. Pac. Symp. 
Blocomput 109-21 (1997). Koehletal.. J. Mol. Biol. 239:249 (1994); KoehletaL. Nat Stmc. Biol. 
2:163 (1995); Koehl etal.. Curr. Opin. Struct Biol. 6:222 (1996); Koehl etal., J. Mol. Bio. 293:1183 
(1999); Koehl et al., J. Mol. BtoL 293:1161 (1999); Lee J. Mol. Biol. 236:918 (1994); and Vasquez 
Biopolymers 36:53-70 (1995); all of which are expressly incorporated by reference.) are used. SCMF 
works by determining the optimal set of probabilities for all rotamer and residue states In the 
simulation, using a self-consistency criterion that relates the mean-fieW energies of the states to their 
probabiBties, and vice versa. The final probabHities may be used to define a list of a favorable 
sequence combinations that define a combinatorial library of protein sequences. As for aD the 
techniques listed herein, SCMF can be used to generate a primary library or a secondary 
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computational library. 

In a preferred embodiment, the sampling technique utilizes genetic algorithms, e.g., such as those 
described by Holland (Adaptation In Natural and Artificial Systems, 1975, Ann Arbor, U. Michigan 
Press). Genetic algorithm analysis generally takes generated sequences and recombines them 
computationally, similar to a nucleic acid recombination event, in a manner similar to gene shuffling . 
Thus the "jumps" of genetic aigorittim analysis generally are multiple position jumps. In addition, as 
outiined below, con-elated multiple jumps may also be done. Such jumps may occur witti different 
crossover positions and more than one recombination at a time, and may involve recombination of 
two or more sequences. Furthermore, deletions or insertions (random or biased) may be done. In 
addition, as outiined below, genetic algoriflim analysis may also be used after the secondary library 
has been generated. 

In a preferred embodiment, Boltzmann sampling is done. As will be appreciated by tiiose in the art, 
the temperature factor criteria for Boltzmann sampling may be altered to allow broad searches at high 
temperature factors and narrow searches close to local optima at low temperature factors (see e.g.. 
Metropolis etaL, J. Chem. Phys. 21:1087, 1953). 

In a preferred embodiment, the sampling technique utilizes simulated annealing, e.g., such as 
described by Kirkpatrick et al. (Science, 220:671-680, 1983). Simulated annealing alters the cutoff for 
accepting good or bad jumps by altering the temperature factor in a systematic manner. That is, 
slowly decreasing the temperature factor will slowly increase the stringency of the cutoff. This allows 
broad searches at high temperature factors to new areas of sequence space and narrow searches at 
low temperature factors to explore regions in detail. 

In a preferred embodiment, the FASTER metiiod is used for determination of global optimization of 
the side chain conformations of proteins. The FASTER metiiod focuses on resolving the combinatorial 
side chain packing problem, by converging on the near-optimal minima, (see Desmet. etal.. Proteins, 
48:31-43, 2002). 

TABOO 

In a preferred embodiment, a diverse set of low-energy sequences Is obtained using a class of 
algorithms referred to as tabu search algorithms. Traditionally, tabu search algorithms have been 
used to search for alternative local minima. The present invention presents a novel use of tabu 
search algorithms by using these algorithms to map amino acid sequence subspaces (see Modern 
Heuristic Search Methods, edited by V.J. Rayward-Smith, et al.. 1996. John Wiley & Sons Ltd.. 
hereby expressly incorporated by reference in its entirety). When used to map sequence space, tiie 
tabu search algorithms are referred to herein as "Taboo" search algoriUims. A Taboo search 

49 



PCT/US02/25588 

assumes that alternative optimization methods, such as protein des^n algorrthms incorporating Dead 
End Elimination, genetic algorithms. IWonte Carlo searches, have been used to provide the location of 
ttie global minimum or a local minimum. Thus, a Taboo search is used for finding other regions or 
subspaces of the search space that contain local minima; preferably those that are reasonably low in 
energy compared to the global minimum. 

Taboo searches are capable of identifying aKemative low energy basins because the search 
incorporates local optima avoidance by recording previously seen solutions by making a list of moves 
which have been made in the recent past of the search and which are tabu or forbidden for a certain 
number of iterations, "mat is. If a move in the search space has been made recently, that move is 
discouraged for some duration of time during tiie sampling procedure. The moves may be forbidden 
for some period of time or search (which can be varied), or weighted against but not forbidden. Such 
a mechanism helps to avoid cycling and serves to promote the identification of alternative low energy 
basBis. This concept is illustrated in Rgure 2. For example, by making the low energy basin 
klentified by PDA ™ technology taboo, the search is forced to discover a different tow energy basin. 
This cycle may be repeated until most or all of ttie alternative low energy basins are identified. These 
alternative low energy basins or subspaces represent regions of the sequence space of a protein that 
should be exptored experimentally by creation of secondary libraries (see Figure 3). 

Once a starting sequence is chosen, a taboo search is done. Preferably, the teboo search is done by 
applying one or more pseudo energies (pE) and serves to temporarily change ttie perceived en ergy 
landscape of the sequence space (see Rgure 4). For example, if a single protein design simulation 
converges at iteration k to a variable protein sequence and stiucture ttiat contains amino ackJ aa in 
rotamer state r at position i, then tiie matrix of skle chain-template energies will be modified at 
iteration k+l as follows: 

where ttie pseudo energy at ttie very frst Iteration is equwalent to ttie cafculated energy: 

and 5EwH,o is defined by the simulation parameters. In some embodiments, ttie SE^boo magnitude is 
dynamic (e.g., random and/or slowly decreasing), again as defined by simulation parameters. E 
represents ttie energy calculated by the force field or scoring function (i.e., Ee* in ttie Rgures). 



Application of ttie pseudo energy increase discourages repeated convergence to solutions containing 
amino acid aa in rotamer state r at position i in subsequent design simulations. However, if ttie 
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current value of pE is insufficient to discourage Incorporation of the rotanner, convergence to the same 
rotamer state in a subsequent simulation will result in additional increase of the pseudo energy. 

In a preferred embodiment, calculated and pseudo energies are stored in separate menrK)ry locations 
so that the calculated energy of any solution may be reported directly. This aspect is Important for 
separating the effects of the taboo search from an accurate assessment of protein sequence 
energies. In a preferred embodiment, the pseudo energy increase Is applied to only one rotamer state 
of a converged aa at position i. In a preferred embodiment, the pseudo energy increase is applied to 
all rotamer states of a converged aa at position L In a preferred embodiment, the pseudo energy 
increase is applied to a plurality of amino acid positions, and or a plurality of rotamer states. Thus, a 
taboo search results in the identification of alternate amino acids/rotamer states for at least one and 
preferably more than one amino acid position. Alternate amino adds/rotamer states may be reused in 
a protein design cycle to generate alternate variable protein sequences. 

In a preferred embodiment, the taboo search is done by applying a probability paranneterto at least 
one amino acid position. Such biased sampling Is generally applied within so-called heuristic or 
stochastic sampling methods such as Monte Carlo or genetic algorithms. For example, many 
heuristic methods use the concept of Boitzmann sampling, which is based on the energy difference 
between two states. In a preferred embodiment, a probability parameter results In a modification of 
the Boitzmann probability (Pb) such that the sampling probability (Ps) is reduced: 

Pb = e^^^iRJ 

rs — e taboo 

In a preferred embodiment any or all of the methods described herein may utilize a recency 
parameter. In other words, application of a recency parameter ensures that the most recent moves in 
sequence space are prohibited for a certain number of iterations. Moves that are considered to be 
prohibited are derived from a running list, which is an ordered list of all moves performed throughout 
the search. If the length of the running list is limited, recency may be viewed as the equivalent of 
short term memory. As will be appreciated by those of skill in the art, one consequence of limiting the 
length of the running list is that the prohibited moves may be encouraged at a later point in the 
simulation to allow for the exploration of a sequence space that has not been visited for some defined 
duration. Thus, recency may be a fixed paranneter or allowed to vary dynamically during the search. 

In a preferred embodiment, recency is applied to the modified energy nnatrix by continual application 
of a damping term to all pseudo energies as follows: 

pE^a^., = A * pEVr.i 
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This continual damping has the simple effect of an exponential decay of the pseudo energy over 
multiple simulation cycles. 

In a preferred embodiment, the damping is applied at every simulation cyde before or after 
application of additional pseudo energy increases. As will be appredated by those of skill in the art 
this approach mathematically enforces a ceiling or upper limit on the magnitude of the pseudo energy, 
defined by the combination of A and 6E. 

In a prefBired embodiment, the application of the damping and pseudo energy terms are reversed. 
In an alternative embodiment, recency is appKed to the modified energy matrix by continual 
application of a damping term to all pseudo energies as follows: 

pE^avj = A-y 

In a preferred embodiment, the frequency parameter is appBed such that the strength of the taboo 
energy increase is dependent on the number of times a given amino acid has occurred at a particular 
position. For example, the pseudo energy equation may be modified to include a fi-equency bias as 
follows: 

P^*'aa.r,l = P^^*^sa/) + faa.r.1 * SEtaboo 

In other words, applied in this manner the strength of the taboo energy increase depends on the 
li^uency of occunrence (faa^J of that amino acid or rotamer in previous solutions. In a preferred 
embodiment, the frequency parameter is biased against the most frequent amino acid residue at a 
particular position. Any or all of these methods involving recency and frequency parameters may be 
used reiteratively or combined in any order. 

Thus, taboo analysis can be done to generate sequences that are not the GMEC but are local minima 
(low energy) as well. As for all the computational methods outlined herein, this may be done at any 
point during the analysis. Thus, for example, a taboo analysis may be done to identify one or more 
starting scaffolds, e.g. even before a primary fibrary is generated. Alternatively, taboo analysis can be 
used as the computational analysis for primary and/or library. Alternatively, taboo analysis can be 
applied in combination with other computational techniques as either part of the primary or secondary 
library generation. For example, taboo constraints may be added to a Monte Carlo search. 

SELECTING SEQUENCFR FQR THE PRIMARY UBRARY 

In general, some subset of all possible sequences is used as the primary library. However, in some 



52 



wo 03/014325 



PCT/US02/25588 



5Y 


Y69.2% 


W17.0% 


110T 


D99.9% 




115A 


D46.1% 


S27.8% 


118E 


147.6% 


D43.0% 


125F 


Y511% 


F43.3% 


129W 


L63.2% 


M28.1% 


168V 


D98.2% 




170A 


T92.3% 


A 6.9% 



H7.3% F 6.0% 



T17.1% A 7.9% 

E 3.6% V 2.5% A1 .4% 

L 3.4% M2.0% 

E 7.5% 



109 



wo 03/014325 



PCT/US02/25588 



These calculations show that PDA™ technology may significantly reduce the dimensionality of the 
problem and may bring it into the scope of gene shufffing and screening techniques (see Figure 3). 

Example 3 

Protocol for TNFa Library Expression and Purification 

Overnight culture pre paratinn- 

Competent TuneKDK)pLysS cells in 96 well-PCR plates were transformed with 1 ul of TNFa Obrary 
DMAs and spread on LB agar plates with 34 g/ml chloramphenicol and 100 ng/ml ampialn. After an 
overnight growth at 37«C, a colony was picked from each plate in 1.5 ml of CG media with 34 Mg/ml 
chloramphenicol and ICQ ng/ml ampicillin kept In 96 deep well btock. The btock was shaken at 250 
rpm at 37«C overnight 

w 

Expression: 

Colonies were peked from the plate into 5 ml CG media (34 ^g/ml chloramphenicol and 100 jig/ml 
ampicillin) in 24-well block and grown at 37-C at 250 rpm until OD600 0.6 were reached, at which 
time IPTG was added to each well to 1nM concentration. The culture was grown 4 extra hours 
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Lysis: 

The 24-weli block was centrifuged at 3000 rpm for 10 minutes. The pellets were resuspended in 700 
ul of lysis buffer (50 mM NaH2P04, 300 mM NaCI, 10 mM imidazole). After freezing at-80**C for 20 
minutes and thawing at 37X twice, MgCl2 was added to 10 mM, and DNase I to 75 ng/ml. The 
mixture was incubated at 37''C for 30 minutes. 

Nl^^ NTA column purification: 

Purification was carried out following Qiagen Ni NTA spin column purification protocol for native 
condition. The purified protein was dialyzed against 1 X PBS for 1 hour at 4°C four times. Dialyzed 
protein was filter sterilized, using Mlllipore multiscreenGV filter plate to allow the addition of protein to 
the sterile mammalian cell culture assay later on. 

Quantification: 

Purified protein was quantified by SDS PAGE, followed by Coomassie stain, and by Kodalc digital 
image densitometry. 

TNFq Activity Assay: 

The activity of variant TNFa protein samples were tested using Vybrant Assay Kit and Caspase Assay 
kit. Sytox Green nucleic acid stain Is used to detect TNF-induced cell permeability in Actinomycin-D 
sensitized cell line. Upon binding to cellular nucleic acids, the stain exhibits a large fluorescence 
enhancement, which is then measured. This stain is excluded from live cells but penetrates cells with 
compromised membranes. 

Caspase assay is a fluorimetric assay, which may differentiate between apoptosis and necrosis In the 
cells. This kit measures the caspase activity, triggered during apoptosis of the cells. 

WEHI cells (Var-13 Ceil Line from ATCC) were plated at 2.5 x 10^ cells/mL, 24 hrs prior to the assay 
(100 (iUwell for the Sytox assay and 50 ^Uwell for the Caspase assay). 



Table 6. Activity Assay Results for Sytox vs. Caspase 
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Neat**: Normalized to 500 ng/mL 

Table 7. Activity Assay Results for Sytox 
:HI TNF-a Activity Assay(%); 
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Conc.(ng/ul) 
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Table 8. Activity Assay Results Sytox vs. Caspase 
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Neat**: Normalized to 500 ng/mL 



Figure 



Sytox Vs. Caspase; % Activitiy Based on Highest Std. Pt 
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Data Analysis: Ruorescence vs. TNFq standard cx)ncentration was plotted to nnake a standard curve. 
Connpare the fluorescence obtained from the highest point on the standard curve (5 ng/nr)L) to the 
fluorescence obtained from the unknown samples, to determine the % activity of the samples. The 
data may be analyzed using a four-parameter fit program to determine the 50% effective 
concentration for TNF (EC50). % Activity of unknown samples = (Fluor. Of unknown samples/fluor. of 
5 ng/mL std. Point) x 100. 
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Example 4 

PDA Technology Calculations for soluble TNF-R (p55) 



Using publicly available protein three^imensional sbudures for the p55 TNFR (Protein Data Bank 
codes 1ext. Incf, Inr) both alone and complexed with its ligand, PDA™ technology was used to 
design optimized soluble p55 receptors as TNF-a antagonists. For the library shown below, the 
sequences shown were generated using PDA™ technology relative to the Protein Data Bank 1 ext 
numbering scheme. Amino acid residues known from the structure of the receptor-TNFo complex to 
be critical for p55 binding to TNFa were designed around. The results shown in Table 1 are an 
example of a Bbrary in which 15 positions from the wild -type p55 receptor were used for PDA™ 
technology design. Four of the positions chosen were nonpolar. 7 of the positions were charged, and 
4 were polar. The library shown in Table 1 was pooled from five independent designs, and a 15% 
cutoff was applied for each position in the library. The size of the library for a single mutation is 78 
and the entire library Is 1.5 x 10" sequences. The wild-type (WT) sequence is shown in the first fine 
of the table. The mutatfon pattern for soluble p55 receptors at given position is shown in the 
remainder of the table. 



Table 9. 
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EXAMPLE 5 

Computational Stabilization of Human Growth Hormone (iiGH) 

Human Growth Homnone (hGH) was computationally redesigned to improve its thermostebility. The 
computational design was performed using a previously developed combinatorial optimization 
algorithm based on the dead-end elimination theorem. The algorithm uses an empirical free energy 
function form scoring designed sequences. This function was augmented with a term that accounts 
for the loss of backbone and side chain conformational entropy. The weighting factors for this term, 
the electrostatic interaction term, and the polar hydrogen burial term were optimized by minimizing the 
number of mutations designed by the algorithm relative to wild- type. Forty-five residues in the core of 
the protein were selected for optimization with the nK)dified potential function. The proteins designed 
using the developed scoring function contained six to ten mutations, showed enhancement in the 
melting temperature of up to 16°C, and were biologically active in cell proliferation studies. (See 
Filikov, et al. Computational stabllfeation of human growth hormone. Protein Science (2002), 11: 
1462-1461, Cold Spring Harbor Laboratory Press, hereby expressly incorporated by reference in its 
entirety.) 

EXAMPLE 6 

DEVELOPMENT OF A CYTOKINE ANALOG WITH ENHANCED STABILrTY USING 
COMPUTATIONAL ULTRAHIGH THROUGHPUT SCREENING 

An ultra high throughput, computational screening method was used to improve the physico-chemical 
characteristics of Granulocyte-Colony Stimulating Factor (G-CSF). Residues in the buried core were 
selected for optimization to minimize changes to the surface, thereby maintaining the active site and 
limiting the designed protein's potential for antigenicity. Using a stiucture that was homology modeled 
from bovine G-CSF, core designs of 25-34 residues were completed, corresponding to 10^^ - 10^ 
sequences screened. The optimal sequence fi'om each design was selected for biophysical 
characterization and experimental testing; each having 10-14 mutations. The designed proteins 
showed enhanced thermal stabilities of up to13 °C. displayed 6- to 10-fold improvements in shelf life, 
and were biologically active in cell proliferation assays and in a neutropenic mouse model. 
(See Luo, et al. Development of a cytokine analog witfi enhanced stability using ultrahigh 
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computational throughput screening. Protein Science (2002). 1 1 : 1218-1226. Cold Spring Harbor 
LalKjratory Press, hereby expressly incorporated by reference in its entire^.) 
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WE CLAIM: 

1 . A method executed by a computer under the control of a program, said computer including a 
memory for storing said program, said method comprising the steps of: 

a) receiving a scaffold protein structure with 

residue positions; 

b) selecting a collection of variable residue 

positions from said residue positions; 

c) establishing a group of potential rotamers for each 

of said variable residue positions, and wherein a 
first group for a first variable residue position 
has a first set of rotamers from at least two 
different amino acid side chains, and wherein a 
second group for a second variable residue 
position has a second set of rotamers from at 
least two different amino acid side chains; and, 

d) analyzing the interaction of each of said rotamers 

in each group with all or part of the remainder 
of said protein to generate a set of optimized 
protein sequences. 

2. A method according to claim 1 wherein said first and second sets of rotamers are different. 

3. A method according to claim 1 wherein said first and second sets of rotamers are the same. 

4. A method executed by a computer under the control of a program, said computer including a 
memory for storing said program, said method comprising the steps of: 

a) receiving a scaffold protein with residue positions; 

b) selecting a collection of variable residue positions from said residue positions; 

c) establishing a group of potential amino acids for each of said variable residue positions, 
wherein a first group for a first variable residue position has a first set of at least two amino 
acid side chains, and wherein a second group for a second variable residue position has a 
second set of at least two different amino acid side chains; and, 

d) analyzing the interaction of each of said amino acids with all or part of the remainder of 
said protein to generate a set of optimized protein sequences. 

5. A method according to claims 1-4 wherein after step d) a library of said optimized protein 
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6. A method according to claim 5 further comprising physically generating at least one member of 
said set of optimized protein sequences and experimentally testing said sequences tor a desired 
function. 

7. A method tot generating a secondary library of scaffold protein variants comprising: 

a) providing a primary library comprising a filtered set of scaffold protein primary variant 
sequences; 

b) generating a list of primary variant positions in said primary library. 

c) combining a pluraOty of said primary variant positions to generate a secondary Bbrary of 
secondary sequences. 

8. A method for generating a secondary library of scafibid protein variants comprising: 

a) providing a prinrary library comprising a filtered set of scaffold protein primary variant 
sequences; 

b) generating a probability distribution of amino acid residues in a plurality of variant positions- 

c) combining a plurality of said amino acid residues to generate a secondary library of 
secondary sequences. 

9. A method according to claim 7 further comprising synthesizing a plurality of said secondary 
sequences. • 

10. A method according to claim 8 wherein said synthesizing is done by multiple PCR with pooled 
oligonucleotides. 

11. A method according to claim 10 wherein said pooled oligonucleotides ara added in equimolar 
amounts. 

12. A method according to daim 10 wherein said pooled oligonucleotides are added in amounts 
that correspond to the frequency of the mutation. 

13. A composition comprising a plurality of secondary variant proteins comprising a subset of said 

secondary library. 

14. A composition comprising a pfurallty of nucleic acids encoding a plurality of secondary variant 
proteins comprising a sut>set of said secondary library. 
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15. A methcxi for generating a secondary library of scaffold protein variants comprising: 

a) providing a first library rank-ordered list of scaffold protein primary variants; 

b) generating a probability distribution of amino acid residues in a plurality of variant positions; 

c) synthesizing a plurality of scaffold protein secondary variants comprising a plurality of said 
amino acid residues to fomia secondary library; 

wherein at least one of said secondary variants is different from said primary variants. 

16. A computational noethod comprising: 

a) receiving a scaffold protein with residue 

positions; 

b) selecting a collection of variable residue 

positions from said residue positions; 

a) providing a sequence alignment of a plurality 
of related proteins; 

b) generating frequencies of occurrence for 

individual amino acids in at least a plurality of positions with said alignments; 
e) creating a pseudo-energy scoring function 
using said frequencies; 

f) using said pseudo-energy scoring function and at feast one additional 
scoring function to generate a set of optimized protein sequences. 

17. A method according to claim 16 wherein said frequencies are weighted. 

18. A method according to claim 17 wherein said frequencies are weighted using a diversity 
weighting function. 

19. A method according to claim 17 wherein said frequencies are weighted using a sequence 
homology weighting function. 

20. A method according to claim 17 wherein said frequencies are weighted using a stmctural 
homology weighting function. 

21. A method according to claim 17 wherein said frequencies are weighted using a weighting function 
based on physical properties. 

22. A method according to claim 17 wherein said frequencies are weighted using a functional-based 
weighing function. 
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23. A method according to claims 19 or 20 wherein if said homology is high, said weighting is high. 

24. A method according to claim 19 or 20 wherein if said homology is high, said weight is low. 

25. A method according to claim 17 wherein said multiple sequence alignment comprises proteins 
with related three-dimensional structures. 

26. A method according to claim 16 wherein pseudo-energy is based on bgarithms of said 
frequencies. 

27. A method according to claim 26 wherein said pseudo energy scoring function is based on log- 
odds ratios. 



28. A method according to claims 16 - 27 wherein after step f) a library of said optimized protein 
sequences is generated. 

29. A method according to claim 28 further comprising physically generating at least one member of 
said set of optimized protein sequences and experimentally testing said sequences for a desired 
function. 



30. A computational method comprising: 

a) receiving a scaffold protein with residue 
positions; 

b) selecting a collection of variable residue 
positions from said residue positions; 

c) providing a sequence alignment of a plurality 
of related proteins; 

d) generating a frequency of occun-ence for 

individual amino adds in at least a plurality of positions with said proteins; 

e) selecting a group of potential amino acids for 

each of said variable residue positions, wherein a first group for a first variable residue 
position has a first set of at least two amino acid side chains, and wherein a second 
group for a second variable residue position has a second set of at least two different 
amino acid side chains according to their frequency of occurrence; and, 
0 analyzing the interaction of each of said 

amino acids at each variable residue position with all or part of the remainder of 
said protein using at least one scoring function to generate a set of optimized 
protein sequences. 
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31. A computational method according to daim 28, wherein amino acids with a frequency of 
occurrence of at least 1% are selected. 

32. A computational method according to claim 28, wherein amino acids with a frequency of 
occurrence of at least 5% are selected. 

33. A computational method according to claim 28, wherein amino acids with a frequency of 
occunrence of at least 1 0% are selected. 

34. A computational niethod according to claim 28, wherein amino adds with a frequency of 
occurrence of at least 20% are selected. 

35. A method according to claim 28 wherein said frequency is weighted. 

36. A method according to claim 28 wherein said frequencies are weighted using a diversity 
weighting function. 

37. A method according to claim 28 wherein said frequencies are weighted using a sequence 
homology weighting functbn. 

38. A method according to claim 28 wherein said frequencies are weighted using a structural 
homology weighting function. 

39. A method according to claim 28 wherein said frequencies are weighted using a weighting function 
based on physical properties. 

40. A nDethod according to claim 28 wherein said frequencies are weighted using a functional-based 
weighing function. 

41. A method according to claims 37 or 38 wherein if said homology is high, said weighting is high. 

42. A method according.to claims 37 or 38 wherein if said homology is high, said weight is low. 

43. A method according to claim 28 wherein said multiple sequence alignment comprises proteins 
with related three-dimensional structures. 

44. A computational method according claims 16-43 wherein said analyzing step further comprises at 
least two scoring functions. 
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45. A method according to claim 28 wherein said scoring function is selected from the group 
consisting of van der W^als potential scoring function, a hydrogen bond potential scoring function, an 
atomic solvation scoring function, an electrostatic scoring function, a secondary structure propensity 
scoring function and a pseudo-energy scoring function. 

46. A method according to claims 28^5 wherein after step f) a libraiy of said optimized protein 
sequences is generated. 

47. A method according to claim 28 further comprising physically generating at least one member of 
said set of optimized protein sequences and experimentally testing said sequences for a desired 
function. 

48. A computational method comprising: 

a) receiving a scaffold protein witii residue 
positions; 

b) selecting a collection of variable residue 
positions from said residue positions; 

c) providing an amino acid substitution mafrix; 

d) creating a pseudo-energy scoring function 
using said matirix; 

e) using said pseudo-eneigy scoring function and 

at least one additional scoring function to generate a set of optimized protein 
sequences. 

49. A computational method according to claim 48 wherein said substitution matrix Is selected from 
the group consisting of PAM, BLOSUM, and DAYHOFF. 

50. A method according to claims 46-49 wherein after step e) a library of said optimized protei n 
sequences is generated. 

51 . A method according to claim 50 further comprising physically generating at least one member of 
said set of optimized protein sequences and experimentally testing said sequences for a desired 
function. 

52. A metiiod executed by a computer under tiie control of a program, said computer including a 
memory for storing said program, said method comprising the steps of: 

a) receiving a scaffold protein with residue 
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positions; 

b) selecting a collection of at least one 

variable residue position from said residue positions; 

c) importing a set of coordinates for a scaffold protein, said scaffold protein 
comprising amino acid positions; 

d) analyzing the interaction of each of said amino acids with all or part of the 
remainder of said protein; 

e) utilizing a plurality of scoring functions, at 

least a first a scoring function having a first weight and a second scoring function 
having a second weight, to generate at least one variable decoy sequence; and, 

f) comparing the scores from said scoring functions of 

said variable decoy sequence to the scores of a reference state to generate modified 
weights, wherein each weight is increased if the corresponding score of the decoy is 
higher than the conresponding score of the reference state and each weight Is 
decreased if the corresponding score of the decoy is lower than the corresponding score 
of the reference state and, wherein the extent of increase or decrease is based on the 
relative Individual and total scores of the decoy and reference states. 

53. A method according to claim 52 comprising repeating steps a) and e) at least one or 
more times to generate a final modified weight for each scoring function. 

54. A method according to claim 52 wherein the collection of variable residue positions is modified 
repeating steps a through Q. 

55. A method according to claim 52 wherein said final modified weight for each scoring function is 
used to generate a set of optimized protein sequences. 

56. A method according to claim 52, wherein the reference state is based on the native sequence 
and structure. 

57. A method according to claim 52, wherein the reference state is a prototypical protein. 

58. A method according to claim 52, wherein said prototypical protein Is derived from a set of proteins 
with similar physical or functional properties. 

59. A method according to claim 52. wherein said weights are optimized on a set of said scaffold 
proteins. 
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60. A method according to claim 52. wherein the extent of increase or decrease of said weights 
based on the total Boltzrnann probabilities of said reference and decoy states. 



IS 



61. A method according to claim 52, wherein the extent of increase or decrease of said weights is 
based on the difference between individual scores of said decoy and reference states. 

62. A method according to claim 52 comprising repladng at least a single amino acid in said scaffold 
protein to create a variable sequence and analyzing said variable sequence using said scoring 
functions. 

63. A method according to claim 52 further comprising replacing a subset of amino acids, said subset 
selected from the group comprising core, boundary, and suriiace amino acids. 

64. A method according to daim 52 further comprising protein design automation. 

65. A method according to daim 52 further comprising sequence prediction algorithm. 

66. A method according to claims 52-65 wherein after step d) a library of said optimized protein 
sequences is generated. 

67. A method according to daim 66 further comprising physically generating at least one member of 
said set of optimized protein sequences and experimentally testing said sequences for a desired 
function. 



68. A method executed by a computer under the control of a program, said computer induding a 
memory for storing said program, said method comprising the steps oft 

a) receiving a scaffold protein with residue positions; 

b) selecting a collection of variable residue positions from said residue positions; 

c) importing a set of coordinates for a scaffold protein, said scaffold protein comprising amino 
add positions; 

d) generating a variable protein sequence comprising a defined energy state for each amino 
acid position; 

e) applying an energy increase to at least one of said defined energy states for a least one of 
said amino add positions; and, 

f) generating at least one altemate variable protein sequence. 

69. A method executed by a computer under the confrol of a program, said computer including a 
memory for storing said program, said method comprising the steps of: 
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a) receiving a scaffold protein with residue positions; 

b) selecting a collection of variable residue positions from said residue positions; 

c) importing a set of coordinates for a scafibld protein, said scaffold protein comprising amino 

■ 

add positions; 

d) generating a variable protein sequence comprising a defined energy state for each amino 
acid position; 

e) applying a probability parameter to at least one of said amino acid positions; and 

f) generating at least one alternate variable protein sequence. 

70. A method according to claim 68 or 69 wherein said energy increase is applied to a plurality of 
amino acid positions. 

71. A method according to claim 68 or 69 generating a plurality of alternate optimized variable protein 
sequences. 

72. A method according to claim 68 further comprising applying a recency parameter with said 
energy increase. 

73. A method according to claims 68-72 further comprises comparing said altemate optimized 
variable protein sequences. 

74. A method according to claim 68 further comprising applying a frequency parameter with said 
energy increase. 

75. A method according to claim 74 wherein said method comprises biasing said frequency 
parameter against the most frequent amino acid residue at a particular position. 

76. A method according to claim 68 wherein said energy increase includes the energy increase of a 
set of rotamers for at least one amino acid position. 

77. A method according to claim 68 wherein said energy increase includes the energy inaease of a 
set of rotamers for a plurality of amino acid positions. 

78. A method according to claim 68 wherein said protein design cyde comprises applying protein 
design automation technology. 

79. A method according to claim 68 wherein said protein design cycle comprises applying the 
sequence prediction algorithm. 
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80. A method according to daim 68 wherein said protein design cyde comprises applying a force field 
calculation. 

81 . A method according to daims 68^0 wherein after step f) a library of said optimized protein 
sequences is generated. 



82. A method according to daim 81 further comprising physically generating at least one member of 
said set of optimized protein sequences and experimentally testing said sequences for a desired 
function. 



83. A method executed by a computer under the control of a program, said computer induding a 
memory for storing said program, said method comprising the steps of: 

a) receiving a scaffold protein with residue positions; 

b) selecting a collection of variable residue positions 
from said residue positions; 

c) importing a set of coordinates for a scaffold protein, 
said scaffold protein comprising amino add positions; 

d) generating a set of optimized variant protein sequences 
comprising one or more variant amino acids; and. 

e) applying a clustering algorithm to duster said set into 
a plurality of subsets. 

84. A method according to daim 83 further comprising applying a taboo search. 

85. A method according to claim 83 wherein said clustering algorithm comprises a single- linkage 
dustering algorithm. 

86. A method according to claim 83 wherein said dustering algorithm comprises a complete linkage 
dustering algorithm. 



87. A method according to daim 83 wherein said dustering algorithm comprises an average 
linkage clustering algorithm. 

88. A method according to daim 83 wherein saki subsets are dustered according to sequence 
similarity. 

89. A method according to daim 83 wherein said subsets are dustered according to energetic 
similarity. 
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90. A method according to claim 83 wherein DNA shuffling is applied with said subsets to generate a 
library of optimized protein sequences. 

91. A method according to claim 83 wherein said protein design cycle comprises protein design 

* 

automation technology. 

92. A method according to claim 83 wherein said protein design cycle comprises the sequence 
prediction algorithm. 

93. A method according to claim 83 wherein said protein design cycle comprises a force field 
calculation. 

94. A method according to claims 83 or 90 wherein said subsets are used to generate secondary 
libraries comprising related sequences. 

95. A method according to claims 83-90 wherein after step e) a library of said optimized protein 
sequences Is generated. 

96. A method according to claims 94 or 95 further comprising physically g enerating at least one 
member of said set of optimized protein sequences and experimentally testing said sequences for a 
desired function. 

97. A method for identifying proteins that have a similar conformation to a target protein, said method 
comprising: 

a) receiving at least one scaffold protein 

structure with variable residue positions of a target protein; 

b) computationally generating a set of primary variant amino acid sequences that 
adopt a conformation similar to the confonnation of said target protein; and, 

c) Identifying at least one protein sequence that 

is similar to at least one member of said set of primary variants, but is dissimilar 
to said target protein amino acid sequence. 

98. A method according to claim 97, further conprising the step of confirming that said protein will 
adopt said conformation of said target protein. 

99. A method according to claim 97 wherein an amino acid sequence with less than 30% sequence 
identity is dissimilar. 
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100. A method according to claim 97 wherein an amino acid sequence with less than 20% sequence 
Identity is dissimilar. 

101. A method according to claim 97 wherein a similar conformation is a protein comprising a 
position for a given fold. 



102. A method according to claim 97 wherein said computationally generating is applying a protein 
design algorithm. 

103. A method according to claim 102 wherein said computationally generating is applying protein 
design automation. 

104. A method according to claims 97 or 102 wherein said computationally generating step 
comprises a taboo search. 

106. A method according to claim 102 wherein said computationally generating step comprises 
applying a sequence prediction algorithm 

106. A method according to claims 97 or 102 wherein said computationally generated sequences are 
used to create a Position Spedfic Scoring Matrix. 

107. A method according to claim 102 wherein said computationally generating Includes the use of at 
least two scoring functions. 

108. A method according to claim 107 wherein said scoring functions are selected from the group 
consisting of a van der Waals potential scoring function, a hydrogen bond potential scoring function, 
an atomic solvation scoring function, an electrostatic scoring functton and a secondary structure 
propensity scoring function. 

109. A mettiod according to claim 97 wherein the metiiod for identifying said protein comprises 
searching public databases. 

1 1 0. A method according to claim 97 wherein the method for Identifying said protein comprises using 
a dynamic programming algoritiim. 

1 1 1 . A method according to claim 97 wherein said confirming is selected from the group consisting of 
x-ray crystallography. NMR spectroscopy, and combinations ttiereof. 
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112. A method for generating variant protein sequence libraries comprising: 

a) providing populations of at least two double stranded 

donor fragments conresponding to a nucleic acid template; 
• b) adding polymerase primers capable of hybridizing to end 
regions of each of said population of donor fragments; 

f) generating a population of hybrid double stranded 
nrK>lecules wherein one strand comprises a 5'-purification 
tag and the other strand comprises a 5'-phosphorytated 
overhang; 

g) enriching for variant strands by removing strands 
comprising a 5 -biotin moiety; 

h) annealing said variant strands to form at least two 
double stranded ligation substrates; and, 

i) ligating said ligation substrates to form a double 
stranded ligation product wherein said ligation product 
encodes a variant protein. 

113. A method according to claim 1 12 wherein one of said polymerase primers generates a variant 
nucleic acid strand. 

1 14. A method according to claim 112 wherein said template generates a variant nucleic acid strand. 

1 1 5. A method according to claim 112 wherein step e) precedes step d). 

116. A method according to claim 112 wherein steps a) through f) are repeated to generate a variant 
protein. 
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Cluster 
quence (Energy 
# Matrix Sequence 

I r,m,s,d) I 

14 P SGDNGWTDKESADSRYKFYFNDQTNESTSSMPTN 

18 Q SGPNGWMMMMLDSSNYWYYYDLLTNLSTSSEPAS 

54 Q SLPAGWTDMELDSSl^KiryFNMLYQQSTETMPTS 

66 Q SGDNGWTLMMLDSSNVMYYYNKETQLSTMSQPTS 

138 0 SLPNGWMLMELNSARYMYYY2S1MLYNLSTNSMPTN 

156 I 0 I SGPNGWMLMEMDSSSYMFYYKMETQESTWSRPTS 



112 1 SGPAGWTWEMLQNADFWYTFNKIYEQMLEQMPTS 

119 1 NGPSGWTWKMMDSSNVWYTFNDSTNTSTMSMPTN 

127 1 SGPSGV\nyiM^QyiLDNSQFYYTFNNETN 

129 1 SGPDGWTWKMMNDANYWYTFNMATQTATASl^ 

137 1 NGPSGWTWKMMDSADFWYTFNDATNTMLEEMPTS 

141 1 SGPPGWTVTOmPSADFWTYI^MATNEMTESMPND 

157 1 SGPSGWAWKMLDNAQYWYTFNDLYQQSTSSMPND 

169 1 SGPSGWMWMMAQSANFWFTYNLLTNTSLEQMPTS 

176 1 SGPDGWTWimiDSANFWYTFNKETQLALEQMPTN 

182 1 SGPDG\AnyiWKMMDSADFWFTFNDQTO^ 

184 1 NGPDGWAWMMLQNAISrmYTYNNETNESTQEM 

186 1 SGPDGWTWKJ^DNAQFWYTFISIMATNTAQESMPND 

189 1 SGPDGWLWKMLQNANYWYTFNDLTQLATESMPTN 

191 1 NGPSGWNV^MyiMDNAQFW^ 

193 1 SGPPGWTWMMMNDANVWFTFNKLTEEATMSMPTN 

196 1 SGPNGWTWMMLDSANYWYTYNMMYNLATASM^ 

199 1 AGPSGWTWMMLDSAISrmFTYIsIMATNTATE'^ 

203 1 SGPDGWTV!M>lMQSANYWyMFNMATQETTASMPTS 

208 1 SGPDGWAWMMLDNAQFWFMFNNETNTAQESMPTD 

211 1 NGPDGWMWMKLDNADYWYTFNLLTNTATETMPTN 

213 1 SGPSGWTWMmQSANFWYTFNMATNTMLEQMPTS 

215 1 NGPPGWTWMMLQSADYWYTFNMMSNTATASMPTS 

218 1 SGPSGWNl>Ma>QSANYWFTFNTLYQQAQQEMPND 

220 1 NGPSGWTWPMyrPSAQFWYTFNDATDTMQEQMPTS 

222 1 SGPDGWTV^LDSADYWFTFNKLTNTAQASMPND 

224 1 NGPDGWNWMMMDNANFWYTFNLATNTAQEEMPTS 

228 1 SGPSGWAWMMLDNANFWFTFNNETNTSQSEMPND 

235 1 SGPNGWiyiWMI^QNANFWYTFNN^ 

238 1 SGPDGWTWMMLDNAQFWFTFMMETNTAQESMPND 

240 I 1 SGPDGWMWKMMDNANYWYTFNNETNTMQAEMPTS 

242 1 SGPDGlffMWMMAQSADFWYTFNLATNTSQEQMPTS 

245 1 NGPDGWTWMMLDNANFWFTFNMATNTSQESMPTS 

248 I 1 SGPSGWTWMMMQSANFWYTFNMATQESTTSMPTS 

253 I 1 SG PDGWTWMMADNADYWFTFNKLTNTAQAEMPTS" 

257 1 SGPPGVMWMMMQSANFWYTFNMATNESTEQMI^ 

259 1 NGPSGWTWMMLQSANYWYTFNMLYQQMQAEMPTN 

262 1 SGPDGWTV^/MMLDNADFWFTFNKLTNTAQESMPND " 
269 I 1 I SGPPGWTWKMMDSANYWYTFNDLTNTAQASMPND 
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Sequence 
# 


Cluster 
(Energy 
Matrix 
r.m.s.d) 


Sequence 


271 


1 


NGPSGWMWMMLDNANFWFTFNLATNTAQEQMPTS 


277 


1 


SGPDGWTWMMMQSANFV\^^TF^^ 


279 


1 


SGPSGWTlAnyMLQ SANFWYTFNMATNTSQESMPiro 


280 


1 


NGPDGWMWKMMDNADFWYTFNNETNTAQESMPTS 


281 


1 


SGPDGWTV^M^QSAISrraYTFNML™ 


283 


1 


SGPDGWMWKMMDNANFWYTFNDATNTAQESMPND 


285 


1 


SGPDGWnyiWMMLQNAim^TFNLA™ 


290 


1 


SGPDGWMWmLQNANFWYTFNLLTNTAQESMPNT) 


299 


1 


SGPDGV#!\AnynyiLQNANFWYTFNLLTNTAQESMPTS 


300 


1 


SGPDGVmMMLQNANFWYTFNMLTNTAQESMPTS 








4 


2 


SGPDGWTWMMANS SQLQYTFNKETNTSTMENPTN 


8 


2 


NGPDGWMWMKLDSSSEEYTYlsIMATNESTKEEPTS 


10 


2 


AGPNGWTWLEANSSQYKFTYDMLYQQSTSSEPTS 


31 


2 


SLPSGWMV\nyn^SDAQLEYTY2SI^^^ • 


34 


2 


SGPNGWTWMIOSINSSEYWYTYDNETQQSTESNPTN 


44 


2 


NGPDGWTWEMDQSSNRYYTFDKATEQATDTEPTS 


51 


2 


AGPPGWTWLMLDNSDVLYTYNLETQTATMEEPTS 


56 


2 


SGPDGWTWAKADSSELLYTYlSnsrETQEATMSEPAS 


77 


2 


SGPSGWTWMESNDSQTLYMFNNETQQTTMTMPTS 


79 


2 


NLPSGWSWLKMDSASYEYTYNMSSNTATDSEPTS 


85. 


2 


SGPSGWTWEltfimSQDMYMFNKSTMTSTKSMPTS 


88 


2 


SGPAGWTMAMLQNADLRFAFNDATNTTTESDPAN 








1 


3 


NGPDGWTWMMWQNASYEYTFNMLTNTAQWSMPTS 


121 


3 


SGPPGWTWEMWQNADYWYTFNKMTEFWLWEMPTS 


135 


3 


SGPSGWAWiyny[WQSANYWYTYl^ETNTATWSMPND 


139 


3 


SGPDGWTWEMWQNAQYYYTFNKMTNTMTWTMPTS 


143 


3 


SGPSGWTWiyn^QSANYWYTFNKIYEQMTWTMPTS 


155 


3 


NG P S GWTWMMWQ S ANYWYTFNMATNT AQWEMPTS 


159 


3 


AGPSGWNWMMWQSANYYYVFNTIYQV^ 


161 


3 


SGPDGWTWMMMDSAQYWYTFNMLTDT^ 


163 


3 


NGPAGWLWMMWQNANLWYTFNLATNTWTO 


167 


3 


NGPDGWTWMKWNSADYYYTFNKATNTMTWSMPTS 


180 


3 


SGPPGWTWMMWDNANLWYTFNKATNTMTWTMPTS 


197 


3 


NGPSGWMWKMWEN7U3YYYTFNDLTNTMLWEEPTS 


201 


3 


SGPSGWTOMMWDSADYWYTFNKMYEQMQWSMP 


206 


3 


SGPDGWTWKMVroSANYWYTFlTOATNTWTWTMPT^ 


230 


3 


SGPNGWMWMiynAroNAQYWYTFN^ 


232 


3 


SG P SGWMWMMIATONANYWYTFNLETNTMTWSMPND 


246 


3 


SGPSGWNWMMWQSASYWYMFNTLYQQMTWTMPTS 


250 


3 


NGPNGWMWMMWDSAimTYTFNKDT^ 


255 


3 


SGPNGWTWMMWQSANYWYTFNMDTQQMTWSMPTS 


263 


3 


NGPSGWTWl^lMWDSANYWYTFNMATNTMTWSMP 


265 


3 


SGPDGWTWEMWQSANYWYTFNKATNTMTWTMPTS 
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Sequence 
# 



Cluster 
(Energy 
Matrix 
r.m.s.d) 



Sequence 



285 



294 



SG PIX31^m7MMWDNAQYV\r^TFl^LTO 



SGPDGWTWMMWDNAlSnn/JYTFNMATNTMQWSMPTS 



298 



NGPDGWMWMMWDNANYWYTFNLLTNTMQWSMPTS 



93 



151 



187 



NGPDGWTWMMMQNASVEYTFNMATNTATMSEPTS 



NGPPGWTWEMWQSASYEYTFNKMTEQMLWEEPTS 



16 



NGPIXjWTMEMMQNSQFMYAFNKMYEQTQSSDPTS 



21 



NGPPGWTMLMMQSAQVMYAYDMTTQQSTMSMPTN 



32 



SGPDGWLMMELDNSREMYAFNLLTQQSQWSMPTS 



47 



SGPSGWTMAMayPNADLWYSFNNETMI^ 



52 



68 



SGPDGWLMMRMQSAETDYAFNDQTNETTDSMPTN 



SGPDGWTMAELSNSQFKYAFNNMSQQSQSAMPTS 



82 



SGPNGVmiAELDNSRVMYAFNNETQQSTMTMPTS 



91 



NGPDGWNMMKMNSAEFMYAFNTMYEQTQEQQPTS 



100 



NGPNGWLMMMADSANYMYAYNDMYNLAQSSMPTS 



SGPNGWMDEMMDSSNVKYYFNKDTQQTTMTMPTS 



SLPSGWTMLMKQNADEVmrYiamSNTSOWSEPAS 



NGPAGWADKEMADARVKFYFNDQTQLATMTMPTN 



23 



28 



SGPSGWADMELSSANYKYYFNNETNTTTDQMPTS 



SGPDGOTDLMLDNSEYRYYYNMLYNWSOSSMPTS 



30 



NGPNGWTKEMMQSSDYWYYFNKASEOSTASEPAS 



33 



AGPSGWTDAMLDSANFKFYFNDSTMTTTESEPTS 



39 



43 



SGPSGWTDLESENANYKYYYNLETQOSOSSMPAS 



DGPSGWNLMEKSDAREMYYFNNETDTSQWSKPND 



48 



SGPDGWADMMAQSANYRFYYNDMYQQSTSSMPTS 



58 



SGPNGWTDMELDNSRYKYYFNMDTQQTTSSRPTS 



60 



NGPDGWAMMMLQSLNYWYYFNDATNESTWSEPAS 



64 



SGPDGWMNEEMQNSRFKYYFNKDTQQTQESMPTN 



74 



76 



84 



86 



NGPNGWTDLELDNARYKFYYNKATNLSQSSMPTS 



AGPDGWMNEMLQSAEVRYYYNALTNESTMSQPTS 



SGPDGWMDMMLQSANFKYYFNMLTEQTQSAMPND 



AGPNGWTKKKLDSAEYEYYYISDyiMSNESTWSEPTS 



89 



94 



SGPPGWMDMELDSSNVKYYYNLLSNTSTMSMPTN 



SGPEGMMDEMAQSSNYKYYYNKLYNLSQSARPTS 



96 



NGPPGWAMMMIJSySSQFWFYFN]S^TNESTESMPTN 



101 



SLPDGW^^^MEWNSSQYKYYFNKDTQTMTWTMPTN 



105 



NLPPGWTDMEWPDARLKYYFNMMTNTAKWERPTS 



107 



SGPSGWMIJynyiLDSANVMYYYNI^ffi 



109 



DLPSGWLDMEKSNASEKYYFNLDTOOTTWSRPTN 



111 



113 



6_ 
6 



NGPIX3WTLKQMDSSQYiyiYYFNDLTNTM0WAEPTS 



SLPSGWNDMELNSASYKYYFNTAYQLATS SMPT S 



118 



SLPDGWMDMELDNASFKFYYNLLYNWAQESMPTS 
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Sequence 
# 


Cluster 
(Energy 
Matrix 


Sequence 


120 


6 


AGPNGWTDMML ES AEYRYYFOTIDTQETT 


122 


6 


DGPDGWADMELDSSQFKFYYNNETQQAQEAKPND 


124 


6 


NLPNGWTLKMLDNANYMYYFNMATQQSTNSMPTN 


128 


6 


NGPNGWNKMMWNSASYEYYYNMMTQEWTWSEPTS 


130 


6 


SGPEGWMDMELDNSRFKFYFNMLYNLTQSEMPTS 


134 


6 


SGPNGWTDMELDSASYKYYFNMLYQESTSSRPTS 


136 


6 


SGPDGWMDMMMDNSEVRFYFNKDTQQTQMENPTS 


140 


6 


AGPAGWMNMRADSSNYDFYFNNETQQAQSSQPTS 


144 


6 


SGPNGWMIfKKMDSSEVMFYFNNETNLATMSEPTS 


^ it 

148 


6 


SL PDGWTDME ADNSRYKYYFNKATEQS TS SMPTN 


154 


6 


SGPDGWLDMEASNASYKYYFNLLTQLTTSSMPTS 


158 


6 


NGPDGWTKMKLNSADVDYYFNKATNTTTMTEPTS 


160 


6 


SGPNGWMDKEMDNSRFKFYYNNETNQATESMPTN 


162 


6 


DGPPGWMLMKLNSADYMYYFNMMYEQTQSSKPND 


168 


6 


DGPPGWADMELDNSRYKYYFNNMYQLAQSSKPND 


179 


6 


NGPDGWMDMES SDARYKYYFNKDTQEATASRPTS 


181 


6 


SGPSGWNKMMMQSASVEYYFNTLYEQSQMSEPTS 


185 


6 


AGPSGWMLMKADSADYMYYFNLLYQQTTS SQPTS 


188 


6 


SGPNGWMDMEMDSSSVKFYFNNETNESTMTRPTS 


200 


6 


SG PDGWADMES SNAQYK YYFNLLTQESTS SMPTS 


209 


6 


SGPSGWTDMEANSASYKYYFNMATQTSTASMPTS 


223 


6 


SGPSGWTDAESSDAQYKYYFNNETQESTSSQPTS 


225 


6 


AG PPGWMDMELQ S AS YKYYFNMMYEQTTS SRPTS 


229 


6 


NGPDGWTDEEMQSARYKYYFNKLTEQSTSSMPTS 








3 


7 


SGPSGWLNMELDSASFMFMYNLATNESTESMPTS 


6 


7 


SGPPGWNWMKKDSSEYLYMFNTLYEQTTASMPTS 


12 


7 


NGPDGWMKM^EMDNSSVEYMYNLATNLSTMTMPTS 


19 


7 


SGPIXjWNNMRANDADYDYMFNNNTQESTASMPISID 


20 


7 


SGPSGWTWKEMDNSRFLYTFNDSSNTAQESNPTN 


24 


7 


SGPDGWTWMKLDNSEFQYMYNKLYNLSQSSMPTS 


25 


7 


NGPSGWMNEMAQSADYWYTMNTATETTTESEPAS 


35 


7 


SGPPGWTNERMQSADVDYMFNKASNEATMTMPND 


36 


7 


NGPDGWNWMMADNS S YEFTFNTLYQLTQ AERPTD 


37 


7 


SG P SG WMDMML PDAQFMYVYNMMTNTS TS SM PTS 


40 


/ 


rv!(jjFiAjWAj>jyijyiM iNjijAi yijAijyiiHir i o 


46 


7 


SGPNGWTWMELNSSSFLYTFNLLTQESTTSNPAN 


50 


7 


SGPEGWTNEESNDAQFKYMFNKLYQWSQS SMPND 


71 


7 


SGPSGWTWKKADNADYWYTYNDLYNWAQAENPAN 


75 


7 


SG PPGWLWWyiADS SNFWYTFNDMYMFALEQMPND 


80 


7 


sgpdgwmkmmlqssnywfmynlltnestasmpas 


83 


7 


NGPSGWAWMKANSADYLFTYNDAYNLATESEPTS 


92 


7 


sgpsgwtnmeldnssykfmynketnlstasmpts 


98 


7 


sgpdgwtdeemqsasykymfnklteqttssmpts 


99 


7 


ngpsgwtwmkldnsefyftfnmatnqstseepts 
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# 


Cluster 
1 ^energy 

1 r.m.s.d) 


Sequence 


1 lO*^ 


1 7 


bUFbC^WNNMRLySANYDYMYNLLYNTSTA j 


1 106 

1 JL \J \J 


7 


oL?FlMoWAWKJXlbyr\JSQF EYTFNDLYQ^ | 


1 1 1 
1 J. X D 


1 7 
j / 


bCjPWbWAKMKMNSADYDYMFN^ j 


1 1 


1 7 


1 SGPSGwTWMKMDSSDVWFTYNLIYNLATMSMPTS | 


1 1 ^9 


1 7 


j bCjPEGWTNEMlJSlDSQYWYiyiFNKIYE j 


I 1 64. 


1 7 


1 oGPDGWTDIffiMDSSSVKFiyr^fN^ j 


1 JL tD 


1 7 


1 SGPlKjWTDMELNSSSYKFMYNMLTNESTESMPTN 1 


1 17ft 


1 7 


faGPJ^IGW^IWMMADSSQYYFTYNIJMYNQSQSEMPND j 








1 Q 


1 Q 


1 bGPSGWTMEMLQNANFMYAFNKETDTTQESMPND | 


I 1 


1 fi 


bGPPbWNWMMLQSANFWYTFNTIjTQTSQEQNPAN | 




1 P 


1 bGFbCjWTWMKLESAEYLYTYNMESQFMLWEMPTS | 


1 1 7 


1 P 


oGPoGWATOIMMjSASVEFTFNNATNTATMTOPTS 1 


1 *>1 
1 ^ ' 




agpagwtwkmmdsanvyytfndqtneatmsqpts 1 


1 




bGPSGWuNlWMESNSASFLFTFN^ | 


1 41 




AGPPGWMwKMLDSADFWFTFNNLSNTSLEQQPTN | 




P 


sgpngwtdmkwdnseyrymynmmtnestwsmpts 1 


1 4 Q 


P 
o 


ngpsgwtwmkkdsseeyytfnmasntstwsepts 1 


1 3D 


1 p 


ngpsgwnnmmmenaqfwymfntatntttesm^ 1 




p 


bGPPGV\MVMKI©JSASFEFTYNMLTNFALEQMPTS | 


1 61 

1 D X 


Q 1 
O 


i AGPSGwMWMEMDNASVLYTFNLSSMTSQMEQPTS 1 


1 6^5 

i ^ 1 


Q 1 


SGPSGVWWMMLNDAQEYFTFNTLTNTATKENPTN 1 


1 6R 


P 1 


NGPAGWMWKJvWDSADYI^TFI^ 1 


1 6Q 


p 

^ \ 


bGPPGWNWMMKQSANELYTFNLATNTSTWTQPTS 1 


1 79 


p 


bGPDGWAJxIMMLQSASYMYAFNNETQQSTWSMPTS j 


1 ^7 


p 


JNI>aA:^iJV3WWVVMMiy[DSSSEWYT | 


1 QH 


p 


bGPbvjWl WMMKSDAQELYTFNMDTQEATWTMPTS j 


1 

1 >^ 1 


p 


bGPJJGW 1 WMKMDNADVDYTFDMASOTTTMTMPTS | 


1 Q7 

1 ^ ' 1 


p 


bCjFbbWTWIiMLNDASDEYTYNLETQTAQKSMPND j 


1 1 09 


R 1 


bGPAijiWTWKjyiMENADFwY | 


1 1 04 1 


p 1 

o 1 


AGPDGWTWMMMDSSSYWFTFNKATEQSQWSNPND j 


1 1 OP 1 

1 X u o 1 


p 1 


SGPIX3WTWKKMDSSDFWYTFNMATNTAQSEMPTN | 


1 110 1 
1 xxu 1 


p 1 

O 1 


bGPNGwMWMMwNSANYYFTYNK^ | 


i 117 1 


Q 1 

^ 1 


SGPSGwTWEMWQSLQYYYTFNKASNTMTWTQPTS | 


i 1 9*^ 1 
1 J- ^ J 1 


D 1 
o 1 


SGPAGWMWMKMNSASYEYTFNMLSNTSTETMPTS | 


1 


8 


AGPSGWLWMMIjOSAnYWYTYWT.n'rM'rQ'pT?'pnT>»rc 1 


132 


8 


NGPDGWTNMKWESAEYMYMFNKATEQSTWSMPTN | 


133 


8 


SGPPGWMWKMMDNSQFYYTFNDMTNTMQEQNPTN | 


145 


8 


SGPrXSWTWMmPNASFEYTFNMATQTSLEQMPTS | 


147 


8 


NGPPGWNWMMLOSANYWYTYNMMTNEATASMPTS 1 


149 


8 


SGPSGWLWMKNNSADFWFTYNLETNTSQEEMPTS I 


153 


8 


NGPPGWMWKKMDSADFYFTFNl^TNTS^^ I 


165 


8 


NGPSGWMWKMKDSAQYYYTFNDSTNTEQAEEPTS 


171 


8 


NGPSGWTWMMMDNASVEYTFNKETNTTTMTMPTS | 


1 174 1 


8 1 


NGPAGWNWMMMENADYWYTFNTATEQSTQEEPTS ( 
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Sequence 
# 


Cluster 
(Energy 
Matrix 
r.m.s.d) 


Seauence 


194 


8 


NGPNGWAWMKAQNADFYYVFNNETQQTTESMPTS 








11 


9 


SGPSGWLDMRWDSADYNYYFNNMYNTATWARPTD 


78 


9 


SGPNGWNDMMWPNADYKYYFNTATETSQWENPTS 


115 


9 


NGPPGWMMKl^DNANFWYYFNDLTDTEQSERPTS 


126 


9 


SGPDGWTDEEWQSASYKYYFNKLTEQMQWENPAS 


142 


9 


NGPDGWLDMELSNASYKYYFNLLTDTSQSERPTS 


146 


9 


DG PEGWTDAMWD S AQ YKYY FNDL YNWWQ WS KPND 


150 


9 


AGPNGWMLKMWDSANYMYYFNDLYQLMTWTQPTS 


166 


9 


SGPNGWMMEMMQNANFWYYFNKDTQQSTESMPTN 


170 


9 


SGPDGWTLKMWDSAQYMYYFNMATEQWTWSQPTS 


172 


9 


DGPNGWMNEMLQSANFWYYFNKDYQLAQSSKPND 


175 


9 


SGPSGWMDKEWDSASYKYYFNDSSNTWQWAMPTS 


177 


9 


AGPSGWTLEKWQNADYMYYFNMLTNTSTWSQPTS 


183 


9 


SGPNGWTDMEWNSSQYKYYFNMDTQLAQWSMPND 


190 


9 


SGPAGVraOMRWNSADYDYYYNMASNTWQWSMPND 


192 


9 


SGPDGWMLKELDSASYMYYFNDQTNTSQSENPTS 


195 


9 


SGPDGWMDMEWDSSQYKYYFNMATNTWQWSMPND 


198 


9 


SGPDGWLDMEMQSASFKYYFNNETQQSQESRPTS 


202 


9 


NGPNGWMDKELDNSSFKFYFNNETNTAQSSQPTS 


204 


9 


SGPPGWMLMKLNSADFMYYFJSIMLTNTAQEQMPTN 


205 


9 


AGPNGWMDMELDNAQYKYYFIJKDTQQSTSSQPTS 


207 


9 


NGPSGWTDEMMQSADYKYYFNKLYEQTQSENPTS 


210 


9 


DGPNGWTLEMWQSANYMYYFNKMYEQSQWSQPTS 


212 


9 


SGPNGWADMELDSAQYKYYFNNETQQSQSSRPTS 


214 


9 


DGPIX5WMDKEWDNASYKFYFNLLTELWQWSKPND 


216 


9 


SGPNGWMDEMMDNAQFKYYFNKDTQQSQESMPTS 


217 


9 


SGPDGWMLMEADSARYMYYFNMATNTSTSSQPTS 


219 


9 


SGPDGWMDMEWDNASYKYYFNNETNEMTWSRPTS 


221 


9 


SGPNGOTLMMWQNANYMYYF]m)TQQMTWTM 


226 


9 


SGPNGWTLKMWDNAQYMYYFNDLTNTWQWSMPTN 


227 


9 


SG PDGV^^MMMQSADVKYYFNLATQ^ 


231 


9 


SGPDGWTDLELNSASFKFYFNMATQQAQE SNPT S 


233 


9 


NGPDGWTLEMMQSADYMYYFNKLYEQSQSSMPTS 


234 


9 


SGPPGWNDMEWDSAQYKYYFNMATNTMTWSRPTS 


236 


9 


AGPDGWTDKEKDSASYKYYFNDLTNTEQSSQPTS 


237 


9 


NGPSGWMNMMMQSANYWYYFNLATQESTSSMPTS 


239 


9 


SGPSGWTDEEWQSASYKYYFNKLYEQMTWSQPTS 


241 


9 


NGPNGWTDMELDSAQYKFYFNMDTQQATSSRPTS 


243 


9 


SG PPGWTLEMWQNAlSn«rMYYFNKLYEQMQWSMPND 


244 


9 


AGPSGWMMKMMDSAQYWYYFNDSTNTATASQPTS 


247 


9 


SG PDGWMDMELDNAQYKYYFNNETNTAQSSRPTS 


249 


9 


SGPDGWLLMKLDNADYMFYFNLLTNTAQSSMPND 


251 


9 


AGPDGWTDKEMDNAQFKYYFNDATNTMQESQPTS 


252 


9 


NGPNGWMMmLQSAlSnrWYYFNNETQESTSSEPTS 
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Sequence 
# 
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Matrix 


Sequence 


254 


9 


SGPSGWMDKEMDSASFKYYFNDATNTAQESKPND 


256 


9 


NGPDGWLDMELDNAQYKFYFNLLTNTAQSSQPTS 


258 


9 


S G PDG WML KMWDNAD YMWFNNETNTAQWSMPND 


260 


9 


SGPDGWMDKEMDNAQFKYYFNDATNTAQESQPTS 


261 


9 


SGPNGWMNMMWQSANYWYYFNNETQQ]^^ 


264 


9 


SGPDGWMDMELDNAQYKYYFNLLTQQSQSSQPTS 


266 


9 


SGPSGWMDMELDNASFKYYFNNMYQQAQESMPTN 


267 


9 


NGPDGWTMMmQSANY^YFNMLTNTSQSSEPTS 


268 


9 


SGPDGWMDMELDNAQFKFYFNLATNTAQESRPTS 


270 


9 


AGPDGWMDMMLQNADYKYYFNNETQQSQSSQPTS 


272 


9 


SGPDGWTDEMWQSAQYKYYFNKLTNTMQWSMPTS 


273 


9 


SGPIX3WTMMMLQNA]SrYWYYFNMATQQSQSSQPTS 


274 


9 


NGPDGWMDKEMDSASFKYYFNDLTNTAQESMPTS 


275 


9 


SGPSGWMDMMLQNAQYKYYFNLLYQQTQSSRPTS 


276 


9 


NGPIXSWMKMMWDNAlSrZWYYFISnsrE™ 


278 


9 


AGPIXSWTDMELDNAQYKFYFjmjTlSrTAQSSQPTS 


282 


9 


NGPSGWMDMELDNAQYKYYFNLLYQQSQSSQPTS 


284 


9 


SGPDGWTDMMLQS ADYKYYFIsIMLTNTSQ SSMPTS 


287 


9 


NGPDGWADMMLQSANFKYYFNNETNTAQESQPTS 


288 


9 


SGPSGWNMiynyiM)NAN^^^ 


289 


9 


SGPDGWTDMELDNAQYKYYFNMATNTSQSSMPTN 


291 


9 


SGPDGWMDMEWDNASYKYYFNLATNTMQWSQPTS 


292 


9 


NGPTCWTMMMLONANYWYYFNMLTIJTAQSSEPTS 


293 


9 


SGPDGWTDEMLQNAQFKYYFNKLTNTAQESMPTS 


295 


9 


SGPDGWMDMELDNAQYKYYFNLLTNTSQSSQPTS 


296 


9 


NGPlXSWMMMMLONANFWYYFKOyiLTOT 


^K. 

297 


9 


SGPDGWTDMMLQNANFKYYF^DyIATNTAQESQPTS 








26 


10 


SG PNGWMMMOKADSQEMF S FDNMTQQSTWTMPND 


62 


10 


SGPNGWTDIjyLADSSEDRYMYNMimnJS 








22 


11 


NGPNGWiyn^\/MMWNSLSYEFTFNK^ 


38 


11 


SGPNGWTWMKWDSSEYMYTFNMDTNQSTWSEPTS 


45 


11 


SGPAGWMMMMMDSAQYQFAYNKDSNTSTWTMPTS 


53 


11 


SGDNGWMVMMWNSASYEFTYNMMTN 


57 


11 


SGPSGWTMEMWQSADYRYAFNKSSDTSQWADPTN 


67 


11 


AGPSGWTWIJ^VNSASLEFTYNMLYWTATQSEPAS 


70 


11 


NGPNGWMDMKMDS S E YRFMFNMQTMLTTWSMPTN 


73 


11 


SGPSGWMWMMMNSSQLEYTFNRDTlSrrT^ 


81 


11 


SGPPGWTWlynyiMDSAQFOYTFNKMTDTSQEENPTN 


114 


11 


AGPDGWTWMKWDS SELEFTYISIMETNESKWEEPTS 
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These two views of representative structures from cluster 1 , 3, and 9 illustrate 
the relationship between clusters and couplings of amino acids. In the upper 
set» it is clear that clusters 1 and 3 share a coupling pattern that is distinct from 
that of cluster 9. However, the bottom set reveals that on the opposite side of 
the protein, cluster 1 and 3 have a different coupling pattern. 
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